OpenAI’s Voice Engine promises the ability to clone voices with just 15 seconds of audio

Companies such as Age of Learning, HeyGen, and Dimagi have access to this innovative tool. Previously, the creation of synthetic voices and voice cloning was restricted to major studios, but now these technologies are becoming more accessible and widely used.

Advertisment

Recently, OpenAI announced the launch of Voice Engine, a new AI tool designed to create personalized voices. Although the results are still in a preliminary stage, the model preview impressed with its quality. With just 15 seconds of audio and a simple text input, Voice Engine can generate emotional and realistic voices that closely resemble the original voice.

This capability has significant implications, allowing anyone to use the voice of famous individuals for different purposes, such as creating humorous content, falsifying recordings, or even committing fraud. Considering the potential misuse of this technology, the tool is initially being tested by a restricted group of users to ensure its security and integrity.

Although the voice synthesizer associated with Voice Engine was previously used to power ChatGPT’s audio features, it is now presented as a standalone tool, offering new possibilities for voice creation and customization.

Among the companies with access to Voice Engine are Age of Learning, specializing in educational technology; the visual storytelling platform HeyGen; healthcare software manufacturer Dimagi; the creator of the AI communication app Livox; and the healthcare system Lifespan. These companies, already working with synthetic voices, now have the opportunity to explore new possibilities with this advanced technology.

OpenAI’s blog post showcases several samples of Voice Engine in action. In one of them, from a reading done by an individual, versions of the same text were generated in different languages, such as Spanish, Mandarin, German, French, and Japanese. Surprisingly, in each AI-generated sample, the tone and accent of the original speaker were preserved, demonstrating the system’s accuracy.

This demonstration reveals the diverse potential of the voice generator. In the field of accessibility, for example, a person who lost their ability to speak due to an accident could have their voice cloned and used in devices, allowing for more natural communication. While this usage already existed, it was generally associated with generic voices. In entertainment and content production, the ability to have videos in multiple languages could transform local influencers into global figures with minimal effort.

However, the potential of this technology also raises significant concerns, especially regarding misinformation, crimes, fraud, and scams. OpenAI is aware of these concerns and hopes to initiate a dialogue on the responsible use of synthetic voices with this disclosure and its initial users. For this reason, the public release of Voice Engine will only occur after the implementation of security measures that prevent audio forgery. Imagining the impact of this tool being released in an election year across multiple countries highlights the potential challenges that need to be considered and addressed.

Additionally, collaboration across various sectors—including government, media, entertainment, education, civil society, and others—is crucial to testing the tool and providing feedback that can contribute to building a safer platform, although there is some skepticism about this possibility.

As highlighted in the company’s statement, several security measures have already been implemented. These include terms of use that prohibit the use of anyone’s voice without their consent or legal right. Additionally, it is required that it be disclosed that the voices were generated by Voice Engine, and each file contains a watermark to trace its origin. The tool is also monitored to check how it is being used.

OpenAI acknowledges the need for significant changes as AI-generated audio becomes more widely available. For example, phasing out voice-based authentication for bank accounts is being considered. The company emphasizes that any large-scale deployment of synthetic voice technology must be accompanied by voice authentication experiences that ensure the original speaker is consciously adding their voice to the service. Furthermore, it is essential to have a prohibited voices list that detects and prevents the creation of voices too similar to those of prominent figures.

The considerations mentioned highlight the uncertainty surrounding the availability of the tool to the general public and emphasize the importance of simultaneously developing both technical and ethical-legal measures to ensure the integrity of any content. It remains to be determined how the model was trained.

Text-to-speech generation is a field of generative AI that continues to evolve. Other companies using this technique include Podcastle and ElevenLabs. One tool that gained significant attention early last year was VALL-E, which, with just 3 seconds of audio, can capture all the nuances of a voice, preserving the speaker’s emotional tone and acoustic environment while simulating any other speech, even if conditions and emotional tone change slightly.

All of this reinforces the idea that, in the near future, people will need to develop the ability to question and investigate whether something is “real”—in quotes—or not. It is likely that children will soon have subjects teaching verification techniques, including through coding, to avoid being deceived by manipulated metadata.

If previously Spotify needed to partner with AI companies to produce songs by deceased singers—as happened in 2016 when they created a new song by Brazilian rapper Sabotage, who died in 2003—now anyone can create songs by famous singers, whether alive or not. This was demonstrated with the song “Heart On My Sleeve,” which mimics the voices of Drake and The Weeknd and caused a significant impact last year.

It is undeniable that Generative Artificial Intelligence (GenAI) can bring a revolution, especially for the audiovisual industry and, more specifically, for the music industry. Its influence on music will be significant, not only with computers writing songs but also by stimulating new forms of audio synthesis, track mastering, creation of previously impossible instruments, and voice replication.

However, abstracting the creative aspect, it is evident that the risks involved are considerable. Therefore, it is crucial to demand that developers open the database through which the solution was trained, ensuring transparency in the process.

At the same time, we need ethical and legal mechanisms to protect ourselves, as even recording a meeting could be used for malicious purposes. While GenAI can open new avenues for creation and previously unimaginable outreach possibilities in the creative industry, in our daily lives, we face more risks than advantages. The challenge is to understand where this evolution will take us.

OpenAI’s Voice Engine promises the ability to clone voices with just 15 seconds of audio

DISCLAIMER:

ADVERTISER DISCLOSURE:

EDITORIAL NOTE: