Meta-AI Researchers have recently achieved a big breakthrough in generative AI for speech. They’ve developed Voicebox, an progressive AI model that showcases the state-of-the-art performance and the flexibility to generalize to speech-generation tasks without specific training.
Unlike previous speech-generation models, Voicebox utilizes a novel approach called Flow Matching, which surpasses diffusion models when it comes to performance. Voicebox has proven to outperform existing models in each intelligibility and audio similarity while also being as much as 20 times faster. Moreover, it may synthesize speech in six languages and perform noise removal, content editing, style conversion, and diverse sample generation.
Traditionally, generative AI for speech required thorough training for every specific task using fastidiously curated data. Nonetheless, Voicebox breaks this barrier by learning from raw audio and its accompanying transcription. This breakthrough allows the model to switch any a part of a given sample quite than being limited to changing only the top of an audio clip.
The researchers trained Voicebox using over 50,000 hours of recorded speech and transcripts from public-domain audiobooks in English, French, Spanish, German, Polish, and Portuguese. The model was trained to predict speech segments based on surrounding speech and corresponding transcripts. By learning to infill speech from context, Voicebox can generate speech portions in the midst of an audio recording without recreating all the input.
Voicebox’s versatility enables it to excel in various speech-generation tasks. It may possibly perform in-context text-to-speech synthesis, cross-lingual style transfer, speech denoising and editing, and diverse speech sampling. As an illustration, with a two-second input audio sample, Voicebox can match the audio style and use it for text-to-speech generation. This capability has potential applications in helping individuals unable to talk or customizing voices for virtual assistants and nonplayer characters.
One other impressive feature of Voicebox is its ability to perform cross-lingual style transfer. Given a speech sample and a text passage in considered one of the supported languages, Voicebox can generate a reading of the text within the corresponding language. This breakthrough could facilitate natural and authentic communication amongst individuals who speak different languages.
Moreover, Voicebox’s in-context learning makes it proficient in seamlessly editing segments inside audio recordings. It may possibly resynthesize speech segments corrupted by short-duration noise or replace misspoken words without re-recording all the speech. This capability simplifies the strategy of cleansing up and editing audio, potentially revolutionizing audio editing tools.
Furthermore, Voicebox’s training on diverse real-world data enables it to generate speech that higher represents how people naturally talk across different languages. This ability may very well be employed to generate synthetic data for training speech assistant models. Remarkably, speech recognition models trained on Voicebox-generated synthetic speech achieve near-parity with models trained on real speech, leading to minimal accuracy degradation.
While the researchers acknowledge the importance of openness and sharing research with the AI community, they’re withholding public access to the Voicebox model and code because of potential risks of misuse. Of their research paper, they outline the event of a highly effective classifier to differentiate between authentic speech and audio generated with Voicebox, aiming to mitigate possible future risks.
Voicebox represents a big advancement in generative AI for speech, offering a flexible and efficient model that exhibits task generalization capabilities. With the potential for various applications, Voicebox opens up latest possibilities for speech synthesis, cross-lingual communication, audio editing, and training speech recognition models. Because the research community builds upon this breakthrough, the sphere of generative AI for speech is poised for exciting advancements and discoveries.
Check Out The Paper and Meta Article. Don’t forget to affix our 24k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you’ve got any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com
Featured Tools From AI Tools Club
🚀 Check Out 100’s AI Tools in AI Tools Club
Niharika
” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2023/01/1674480782181-Niharika-Singh-264×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2023/01/1674480782181-Niharika-Singh-902×1024.jpg”>
Niharika is a Technical consulting intern at Marktechpost. She is a 3rd 12 months undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the newest developments in these fields.