Home Community Meet CoMoSpeech: A Consistency Model-Based Method For Speech Synthesis That Achieves Fast And High-Quality Audio Generation

Meet CoMoSpeech: A Consistency Model-Based Method For Speech Synthesis That Achieves Fast And High-Quality Audio Generation

Meet CoMoSpeech: A Consistency Model-Based Method For Speech Synthesis That Achieves Fast And High-Quality Audio Generation

With the growing human-machine interaction and entertainment applications, text-to-speech (TTS) and singing voice synthesis (SVS) tasks have been widely included in speech synthesis, which strives to generate realistic audio of individuals. Deep neural network (DNN)-based methods have largely taken over the sphere of speech synthesis. Typically, a two-stage pipeline is used, with the acoustic model converting text and other controlling information into acoustic features (resembling mel-spectrograms) before the vocoder further converts the acoustic features into audible waveforms. 

The 2-stage pipeline has succeeded since it acts as a “relay” to unravel the dimension-exploding issue of translating short texts to long audios with a high sampling frequency. Frames describe acoustic characteristics. The acoustic characteristic that the acoustic model produces, often a mel-spectrogram, significantly impacts the standard of the synthesized talks. Convolutional neural networks (CNN) and Transformers are steadily employed in industry-standard methods like Tacotron, DurIAN, and FastSpeech to forecast the mel-spectrogram from the governing component. The power of diffusion model approaches to generate high-quality samples has gained loads of interest. The 2 processes that make up a diffusion model, also referred to as a score-based model, are a diffusion process that steadily perturbs data into noise and a reverse process that slowly transforms noise back to data. The diffusion model’s need for several iterations for generation is a serious flaw. Several techniques based on the diffusion model have been suggested for acoustic modeling in voice synthesis. The sluggish generating speed issue still exists in most of those works. 

Grad-TTS developed a stochastic differential equation (SDE) to unravel the reverse SDE, which is utilized to unravel the noise to mel-spectrogram transformation. Despite producing great audio quality, the inference speed is slow for the reason that reverse method requires loads of iterations (10–1000). Progressive distillation was added to Prodiff when it was being developed further to reduce the sample processes. DiffGAN-TTS used an adversarially-trained model in Liu et al. to roughly represent the denoising function for effective voice synthesis. The ResGrad in Chen et al. estimates the prediction residual from pre-trained FastSpeech2 and ground truth using the diffusion model. 

🚀 JOIN the fastest ML Subreddit Community

From the outline above, it is obvious that speech synthesis has three goals: 

Excellent audio quality: The generative model should faithfully capture the subtleties of the speaking voice that add to the expressiveness and naturalness of the synthesized audio. Recent research has focused on voices with more intricate changes in pitch, timing, and emotion along with the distinctive speaking voice. Diffsinger, as an example, demonstrates how a well-designed diffusion model may provide a synthesized singing voice of excellent quality after 100 iterations. Moreover, it’s essential to forestall artifacts and distortions within the created audio.

Quick inference: Quick audio synthesis is vital for real-time applications, including communication, interactive speech, and music systems. Simply being quicker than real-time for voice synthesis is insufficient when making time for other algorithms in an integrated system. 

Beyond speaking: More intricate voice modeling, resembling singing voice, is required instead of the distinctive speaking voice when it comes to pitch, emotion, rhythm, breath control, and timbre. 

Although quite a few attempts have been made, the trade-off issue between the synthesized audio quality, model capability, and inference speed persists in TTS. It’s more obvious in SVS on account of the mechanism of the denoising diffusion process when performing the sampling. Existing approaches often aim to mitigate slightly than completely resolve the slow inference problem. Despite this, they have to be faster than traditional approaches without using diffusion models like FastSpeech2. 

The consistency model has recently been developed, producing high-quality images with only one sampling step by expressing the stochastic differential equation (SDE), describing the sampling process as an strange differential equation (ODE), and further enforcing the consistency property of the model on the ODE trajectory. Despite this accomplishment in picture synthesis, there currently must be a known voice synthesis model based on the consistency model. This implies that it is feasible to develop a consistent model-based voice synthesis technique that mixes high-quality synthesis with quick inference speed. 

On this study, researchers from Hong Kong Baptist University, Hong Kong University of Science and Technology, Microsoft Research Asia and Hong Kong Institute of Science & Innovation offer CoMoSpeech, a swift and high-quality speech synthesis approach based on consistency models.  Their CoMoSpeech is derived from an instructor who has already received training. More specifically, their teacher model uses the SDE to learn the matching scoring function and easily translate the mel-spectrogram into the Gaussian noise distribution. After training, they construct the teacher denoiser function using the associated numerical ODE solvers, which is then utilized for further consistency distillation. Their CoMoSpeech with consistent characteristics is produced by distillation. Ultimately, their CoMoSpeech can generate high-quality audio with a single sample step. 

The findings of their TTS and SVS trials reveal that the CoMoSpeech can produce monologues with a single sample step, which is greater than 150 times quicker than in real-time. The study of audio quality also reveals that CoMoSpeech provides audio quality that’s superior to or on par with other diffusion model techniques that need tens to lots of of iterations. The diffusion model-based speech synthesis is now practicable for the primary time. Several audio examples are given on their project website.

Take a look at the Paper and Project. Don’t forget to hitch our 21k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you’ve any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. He’s currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects geared toward harnessing the ability of machine learning. His research interest is image processing and is enthusiastic about constructing solutions around it. He loves to attach with people and collaborate on interesting projects.


Please enter your comment!
Please enter your name here