Home Community This AI Paper from NVIDIA Unveils ‘Incremental FastPitch’: Revolutionizing Real-Time Speech Synthesis with Lower Latency and High Quality

This AI Paper from NVIDIA Unveils ‘Incremental FastPitch’: Revolutionizing Real-Time Speech Synthesis with Lower Latency and High Quality

0
This AI Paper from NVIDIA Unveils ‘Incremental FastPitch’: Revolutionizing Real-Time Speech Synthesis with Lower Latency and High Quality

Parallel Text-to-Speech (TTS) models are commonly used for on-the-fly speech synthesis, providing enhanced control and faster synthesis than traditional auto-regressive models. Despite their benefits, parallel models, particularly those based on transformer architecture, face challenges regarding incremental synthesis. This limitation arises from their fully parallel structure. The growing prevalence of real-time and streaming applications has spurred a necessity for TTS systems that may generate speech incrementally, catering to the demand for streaming TTS. This adaptation is crucial for achieving lower response latency and enhancing the user experience.

Researchers from NVIDIA Corporation propose Incremental FastPitch, a variant of FastPitch, which may incrementally produce high-quality Mel chunks with lower latency for real-time speech synthesis. The proposed model improves the architecture with chunk-based FFT blocks, training with receptive field-constrained chunk attention masks, and inference with fixed-size past model states. This leads to comparable speech quality to parallel FastPitch but significantly lower latency. It employs training with constrained receptive fields and explores using each static and dynamic chunk masks. This exploration is crucial to make sure the model effectively aligns with limited receptive field inference during synthesis. 

https://arxiv.org/abs/2401.01755

A Neural TTS system typically comprises two predominant components: an acoustic model and a vocoder. The method begins with converting text into Mel-spectrograms using acoustic models like Tacotron 2, FastSpeech, FastPitch, and GlowTTS. Subsequently, the Mel features are transformed into waveforms using vocoders resembling WaveNet, WaveRNN, WaveGlow, and HiF-GAN. The study also mentions using the Chinese Standard Mandarin Speech Corpus for training and evaluation, which incorporates 10,000 audio clips of a single Mandarin female speaker. The proposed model parameters follow the open-source FastPitch implementation, with modifications within the decoder using causal convolution within the position-wise feed-forward layers.

The Incremental FastPitch is a variant of FastPitch that comes with chunk-based FFT blocks within the decoder to enable incremental synthesis of high-quality Mel chunks. The model is trained using receptive field-constrained chunk attention masks, which help the decoder adjust to the limited receptive field in incremental inference. The proposed model also utilizes fixed-size past model states during inference to take care of Mel continuity across chunks. The Chinese Standard Mandarin Speech Corpus trains and evaluates the model. The model parameters follow the open-source FastPitch implementation, using causal convolution within the position-wise feed-forward layers. The Mel-spectrogram is generated through an FFT size of 1024, a hop length of 256, and a window length of 1024, applied to the normalized waveform. 

https://arxiv.org/abs/2401.01755

Experimental results show that Incremental FastPitch can produce speech quality comparable to parallel FastPitch, with significantly lower latency, making it suitable for real-time speech applications. The proposed model incorporates chunk-based FFT blocks, training with receptive field-constrained chunk attention masks, and inference with fixed-size past model states, contributing to improved performance. A visualized ablation study demonstrates that incremental FastPitch can generate Mel-spectrograms with almost no observable difference in comparison with parallel FastPitch, highlighting the effectiveness of the proposed model. 

In conclusion, The Incremental FastPitch, a variant of FastPitch, enables incremental synthesis of high-quality Mel chunks with low latency for real-time speech applications. The proposed model incorporates chunk-based FFT blocks, training with receptive field constrained chunk attention masks, and inference with fixed size past model states, leading to speech quality comparable to parallel FastPitch but with significantly lower latency. A visualized ablation study shows that Incremental FastPitch can generate Mel-spectrograms with almost no observable difference in comparison with parallel FastPitch, highlighting the effectiveness of the proposed model. The model parameters follow the open-source FastPitch implementation, with modifications within the decoder using causal convolution within the position-wise feed-forward layers. Incremental FastPitch offers a faster and more controllable speech synthesis process, making it a promising approach for real-time applications.


Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

In case you like our work, you’ll love our newsletter..


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is keen about applying technology and AI to handle real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.


[Partnership and Promotion on Marktechpost] 🐝 Now you possibly can partner with Marktechpost to advertise your Research Paper, Github Repo and even add your pro commentary in any trending research article on marktechpost.com. Elevate your and your organization’s AI research visibility within the tech community…Learn more

LEAVE A REPLY

Please enter your comment!
Please enter your name here