The AI ​​voice actors look more human than ever – and are ready to take on

The company’s blog post drops with the enthusiasm of a U.S. advertising firm in the 1990s. WellSaid Labs describes what customers can expect from its “eight new digital voice actors!” Tobin is “energetic and insightful.” Paige is “ready and expressive”. Ava is “polished, confident and professional”.

Each is based on a real voice actor, whose similarity (with consent) has been preserved using AI. Companies can now license these voices to say whatever they need. They simply feed some text into the voice engine, and output a clean audio clip of a natural sound show.

WellSaid Labs, a Seattle-based startup born from the nonprofit research Allen Institute of Artificial Intelligence, is the latest company to offer AI voices to customers. For now, he specializes in voices for corporate e-learning videos. Other startups do voice for numerical assistants, call center operators, and even video game characters.

Not too long ago, such deep voices had a bad reputation for their use in scam call and internet cheating. But its improving quality has since aroused the interest of a growing number of companies. Recent discoveries in deep learning have made it possible to replicate many of the subtleties of human language. These voices stop and breathe in all the right places. They may change their style or emotion. You may discover the trick if you talk too long, but in short audio clips, some have become indistinguishable from humans.

AI voices are also economical, scalable and easy to work with. Unlike a recording of a human voice actor, synthetic voices can also update their script in real time, opening up new opportunities to customize advertising.

But the rise of false hyperrealistic voices is not without consequences. Human voice actors, in particular, have been left to wonder what it means for their livelihoods.

How to fake a voice

Synthetic voices have been around for a while. But the old ones, including the voices of the original Siri and Alexa, simply gluing words and sounds together to achieve a goofy and robotic effect. Making them sound more natural was a demanding manual job.

Deep learning has changed that. Vocal developers no longer need to dictate the exact stimulation, pronunciation, or intonation of the generated speech. Instead, they could feed a few hours of audio into an algorithm and have the algorithm learn those models by itself.

“If I’m Pizza Hut, I certainly can’t sound like Domino, and I certainly can’t sound like Pope John.”

Rupal Patel, founder and CEO of VocaliD

Over the years, researchers have used this basic idea to build voice engines that are becoming more and more sophisticated. That WellSaid Labs build, for example, uses two primary models of deep learning. The first predicts, from a text passage, the broad outlines of what a speaker will sound — including accent, tone, and timbre. The second fills in the details, including the breaths and the way the voice resonates in their environment.

Making a convincing synthetic voice requires even more just by pressing a button. Part of what makes a human voice so human is its inconsistency, expressiveness, and ability to convey the same lines in completely different styles, depending on the context.

Capturing these nuances involves seeking appropriate vocal actors to provide appropriate training data and refine in-depth learning models. WellSaid says the process requires at least an hour or two of audio and a few weeks of work to develop a realistic synthetic sound replica.

AI voices have become particularly popular among brands looking to maintain a consistent sound through millions of interactions with customers. With the ubiquity of smart speakers today, and the rise of automated customer service agents and digital assistants integrated into smart cars and devices, brands may need to produce more than a hundred hours of work. audio to the month. But they no longer want to use the generic voices offered by traditional text-to-speech technology – a trend that is accelerating during the pandemic as more and more customers skip the interaction in the store to engage virtually with companies.

“If I’m Pizza Hut, I certainly can’t sound like Domino, and I certainly can’t sound like Pope John,” says Rupal Patel, a professor at Northeastern University and the founder and CEO of VocaliD, who promises to build personalized voices that correspond to a company’s brand identity. “These brands have thought about their colors.” They thought of their characters. Now we need to start thinking about the sound of his voice. “

While companies would have to employ multiple voice actors for different markets — the Northeast versus the Southern United States, or France versus Mexico — some AI voice companies could manipulate the accent or change the language of a single voice. in different ways. This opens up the possibility of adapting ads on streaming platforms based on who is listening, changing not only the features of the voice, but also the words that have been said. A beer ad might tell a listener to stop at a different pub depending on whether it’s played in New York or Toronto, for example., which designs voice for advertising and smart assistants, says it is already working with customers to launch such custom audio ads on Spotify and Pandora.

The gaming and entertainment industry also sees benefits. Sonantic, a company that specializes in emotional voices that can laugh and cry or whisper and scream, works with video game creators and animation studios to provide voice acting for their characters. Many of their customers use synthesized voices only in pre-production and switch to real voice actors for final production. But Sonantic says a few have started using them throughout the process, perhaps for characters with fewer lines. and others have also worked with movies and TV shows to pick up on the performance of actors when words are distorted or mispronounced.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button