Home Community How Do Schrodinger Bridges Beat Diffusion Models On Text-To-Speech (TTS) Synthesis?

How Do Schrodinger Bridges Beat Diffusion Models On Text-To-Speech (TTS) Synthesis?

How Do Schrodinger Bridges Beat Diffusion Models On Text-To-Speech (TTS) Synthesis?

With the growing variety of advancements in Artificial Intelligence, the fields of Natural Language Processing, Natural Language Generation, and Computer Vision have gained massive popularity recently, all because of the introduction of Large Language Models (LLMs). Diffusion models, which have proven to achieve success in producing text-to-speech (TTS) synthesis, have shown some great generation quality. Nonetheless, their prior distribution is restricted to a representation that introduces noise and offers little information in regards to the desired generation goal.

In recent research, a team of researchers from Tsinghua University and Microsoft Research Asia has introduced a brand new text-to-speech system called Bridge-TTS. It’s the primary try to substitute a clean and predictable alternative for the noisy Gaussian prior utilized in well-established diffusion-based TTS approaches. This alternative prior provides strong structural information in regards to the goal and has been taken from the latent representation extracted from the text input.

The team has shared that the essential contribution is the event of a very manageable Schrodinger bridge that connects the ground-truth mel-spectrogram and the clean prior. The suggested bridge-TTS uses a data-to-data process, which improves the data content of the previous distribution, in contrast to diffusion models that function through a data-to-noise process.

The team has evaluated the approach, and upon evaluation, the efficacy of the suggested method has been highlighted by the experimental validation conducted on the LJ-Speech dataset. In 50-step/1000-step synthesis settings, Bridge-TTS has demonstrated higher performance than its diffusion counterpart, Grad-TTS. It has even performed higher in few-step scenarios than strong and fast TTS models. The Bridge-TTS approach’s primary strengths have been emphasized as being the synthesis quality and sampling efficiency. 

The team has summarized the first contributions as follows.

  1. Mel-spectrograms have been produced from an uncontaminated text latent representation. Unlike the normal data-to-noise procedure, this representation, which functions because the condition information within the context of diffusion models, has been created to be noise-free. Schrodinger bridge has been used to research a data-to-data process.
  1. For paired data, a completely tractable Schrodinger bridge has been proposed. This bridge uses a reference stochastic differential equation (SDE) in a versatile form. This method permits empirical investigation of design spaces along with offering a theoretical explanation. 
  1. It has been studied that how the sampling technique, model parameterization, and noise scheduling contribute to improved TTS quality. An asymmetric noise schedule, data prediction, and first-order bridge samplers have also been implemented. 
  1. The whole theoretical explanation of the underlying processes has been made possible by the fully tractable Schrodinger bridge. Empirical investigations have been carried out to be able to comprehend how different elements affect the standard of TTS, which incorporates examining the results of asymmetric noise schedules, model parameterization decisions, and sampling process efficiency.
  1. The tactic has produced great outcomes when it comes to inference speed and generation quality. The diffusion-based equivalent Grad-TTS has been greatly outperformed by the tactic in each 1000-step and 50-step generation situations. It also outperformed FastGrad-TTS in 4-step generation, the transformer-based model FastSpeech 2, and the state-of-the-art distillation approach CoMoSpeech in 2-step generation.
  1. The tactic has achieved outstanding outcomes after only one training session. This efficiency is visible at several stages of the creation process, demonstrating the dependability and potency of the suggested approach.

Take a look at the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to hitch our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.

For those who like our work, you’ll love our newsletter..

Tanya Malhotra is a final 12 months undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and important pondering, together with an ardent interest in acquiring latest skills, leading groups, and managing work in an organized manner.

🐝 [Free Webinar] LLMs in Banking: Constructing Predictive Analytics for Loan Approvals (Dec 13 2023)


Please enter your comment!
Please enter your name here