Home Community Revolutionizing Text-to-Speech Synthesis: Introducing NaturalSpeech-3 with Factorized Diffusion Models

Revolutionizing Text-to-Speech Synthesis: Introducing NaturalSpeech-3 with Factorized Diffusion Models

0
Revolutionizing Text-to-Speech Synthesis: Introducing NaturalSpeech-3 with Factorized Diffusion Models

Recent advancements in text-to-speech (TTS) synthesis have struggled to realize high-quality results on account of the complexity of speech, which involves various attributes like content, prosody, timbre, and acoustic details. While scaling up dataset size and model complexity has shown promise for zero-shot TTS, issues with voice quality, similarity, and prosody persist. Attempts to deal with these challenges involve decomposing speech into distinct subspaces representing different attributes for individual generations. Nonetheless, effectively disentangling these attributes stays difficult despite approaches akin to neural audio codecs based on residual vector quantization.

Researchers from Microsoft Research Asia & Microsoft Azure Speech, the University of Science and Technology of China, The Chinese University of Hong Kong, Zhejiang University, The University of Tokyo, and Peking University have developed a TTS system called NaturalSpeech 3. This method employs factorized diffusion models to generate high-quality speech in a zero-shot manner. The approach involves a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into distinct subspaces of content, prosody, timbre, and acoustic details. A factorized diffusion model generates attributes in each subspace based on corresponding prompts. This factorization simplifies speech representation, enabling efficient learning and improved attribute control.

Recent advancements in TTS research have focused on 4 key areas: zero-shot synthesis, speech representations, generation methods, and attribute disentanglement. Zero-shot TTS goals to generate speech for unseen speakers using various data representations and modeling techniques. Speech representations have evolved from traditional waveform and mel-spectrogram-based approaches to more data-driven methods like discrete tokens and continuous vectors. Generation methods vary between autoregressive (AR) and non-autoregressive (NAR) models, with NAR models showing benefits in robustness and speed, while AR models offer higher diversity and expressiveness. Attribute disentanglement techniques, akin to those utilizing neural speech codecs, aim to separate speech attributes like content, prosody, and timbre for improved synthesis quality.

NaturalSpeech 3 is a complicated text-to-speech system prioritizing prime quality, similarity, and control. It utilizes a neural speech codec (FACodec) and a factorized diffusion model to individually handle speech attributes like duration, prosody, content, acoustic details, and timbre. This approach ensures superior synthesis quality and controllability. Constructing on previous versions, it emphasizes diverse synthesis across various scenarios, leveraging large datasets for zero-shot synthesis. The FACodec employs factorized vector quantizers for efficient attribute representation, simplifying speech complexity. NaturalSpeech 3 offers efficient and effective synthesis with enhanced speech quality and controllability.

NaturalSpeech showcases higher performance in speech quality, similarity, and robustness. Through extensive evaluation of LibriSpeech and RAVDESS datasets, NaturalSpeech 3 demonstrates significant advancements, particularly in generation quality, speaker similarity, and prosody similarity. Ablation studies validate the effectiveness of factorization, classifier-free guidance, and prosody representation. Furthermore, the scalability evaluation illustrates the system’s capability to enhance with larger datasets and model sizes, emphasizing its potential for further enhancement.

In conclusion, NaturalSpeech 3 is a groundbreaking TTS system incorporating a neural speech codec, FACodec, and factorized diffusion models. NaturalSpeech 3 achieves remarkable advancements in speech quality, similarity, prosody, and intelligibility by disentangling speech attributes into distinct subspaces and synthesizing them with discrete diffusion. Furthermore, it enables the manipulation of fine-grained speech attributes. Scaling the model to 1B parameters and 200K hours of information further enhances its performance. Nonetheless, the system’s reliance on English data from LibriVox poses limitations in voice diversity and multilingual capabilities, which researchers aim to deal with through expanded data collection.


Try the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

In case you like our work, you’ll love our newsletter..

Don’t Forget to hitch our Telegram Channel

Chances are you’ll also like our FREE AI Courses….


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is captivated with applying technology and AI to deal with real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.


🚀 [FREE AI WEBINAR] ‘Constructing with Google’s Latest Open Gemma Models’ (March 11, 2024) [Promoted]

LEAVE A REPLY

Please enter your comment!
Please enter your name here