The goal of text-to-speech (TTS) is to generate high-quality, diverse speech that appears like real people spoke it. Prosodies, speaker identities (similar to gender, accent, and timbre), speaking and singing styles, and more all contribute to the richness of human speech. TTS systems have improved greatly in intelligibility and naturalness as neural networks and deep learning have progressed; some systems (similar to NaturalSpeech) have even reached human-level voice quality on single-speaker recording-studio benchmarking datasets.
On account of an absence of diversity in the information, previous speaker-limited recording-studio datasets were insufficient to capture the big variety of speaker identities, prosodies, and styles in human speech. Nonetheless, using few-shot or zero-shot technologies, TTS models might be trained on a big corpus to learn these differences after which use these trained models to generalize to the infinite unseen scenarios. Quantizing the continual speech waveform into discrete tokens and modeling these tokens with autoregressive language models is common in today’s large-scale TTS systems.
Latest research by Microsoft introduces NaturalSpeech 2, a TTS system that uses latent diffusion models to provide expressive prosody, good resilience, and, most crucially, strong zero-shot capability for voice synthesis. The researchers began by training a neural audio codec that uses a codec encoder to remodel a speech waveform right into a series of latent vectors and a codec decoder to revive the unique waveform. After obtaining previous vectors from a phoneme encoder, a duration predictor, and a pitch predictor, they use a diffusion model to construct these latent vectors.
The next are examples of design decisions which might be discussed of their paper:
- In prior works, speech is usually quantized with quite a few residual quantizers to ensure the standard of the neural codec’s speech reconstruction. This burdens the acoustic model (autoregressive language model) heavily since the resultant discrete token sequence is kind of long. As a substitute of using tokens, the team used continuous vectors. Subsequently, they employ continuous vectors as a substitute of discrete tokens, which shorten the sequence and supply more data for accurate speech reconstruction on the granular level.
- Replacing autoregressive models with diffusion ones.
- Learning in context through speech prompting mechanisms. The team developed speech prompting mechanisms to advertise in-context learning within the diffusion model and pitch/duration predictors, improving the zero-shot capability by encouraging the diffusion models to stick to the characteristics of the speech prompt.
- NaturalSpeech 2 is more reliable and stable than its autoregressive predecessors because it requires only a single acoustic model (the diffusion model) as a substitute of two-stage token prediction. In other words, it may use its duration/pitch prediction and non-autoregressive generation to use to styles apart from speech (similar to a singing voice).
To display the efficacy of those architectures, the researchers trained NaturalSpeech 2 with 400M model parameters and 44K hours of speech data. They then used it to create speech in zero-shot scenarios (with only a couple of seconds of speech prompt) with various speaker identities, prosody, and styles (e.g., singing). The findings show that NaturalSpeech 2 outperforms prior powerful TTS systems in experiments and generates natural speech in zero-shot conditions. It achieves more similar prosody with the speech prompt and ground-truth speech. It also achieves comparable or higher naturalness (regarding CMOS) than the ground-truth speech on LibriTTS and VCTK test sets. The experimental results also show that it may generate singing voices in a novel timbre with a brief singing prompt or, interestingly, with only a speech prompt, unlocking the truly zero-shot singing synthesis.
In the long run, the team plans to research effective methods, similar to consistency models, to speed up the diffusion model and investigate widespread speaking and singing voice training to enable stronger mixed speaking/singing capabilities.
Take a look at the Paper and Project Page. Don’t forget to hitch our 20k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more. If you could have any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Tanushree
” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-225×300.jpeg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-768×1024.jpeg”>
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest within the scope of application of artificial intelligence in various fields. She is obsessed with exploring the brand new advancements in technologies and their real-life application.