
The recent developments and the progress within the capabilities of enormous language models have played an important role within the advancements of LLM-based frameworks for audio generation and speech synthesis tasks especially within the zero-shot setting. Traditional speech synthesis frameworks have witnessed significant advancements in consequence of integrating additional features like neural audio codecs for discreet audio and speech units. Though these speech and audio synthesis frameworks deliver satisfactory results, there continues to be room for improvement as the present LLM-based audio frameworks have the next three major limitations
- They have an inclination to auto-generate audio output that ultimately causes a scarcity of robustness and slow interference speeds and leads to mispronunciation, skipping, or repeating.
- They have an inclination to over-rely on discrete speech units or pre-trained neural audio codec.
- They often require a considerable amount of training data.
To tackle the problems mentioned above, and improve the capabilities of LLM-based audio and speech synthesis models, developers have give you HierSpeech++, a sturdy and efficient zero-shot speech synthesizer for voice and text to speech or TTS conversions. The HierSpeech++ framework builds upon the learnings of hierarchical speech synthesis frameworks that not only boosts the robustness, but in addition adds to the expressiveness of synthetic speech output while also boosting the naturalness and speaker similarity of artificially generated speech even in a zero-shot setting.
In this text, we will probably be talking in regards to the HierSpeech++ framework intimately, and have a take a look at the model’s architecture, working, and results compared against state-of-the-art text and audio generation models. So let’s start.
The HierSpeech++ is a quick, robust, and efficient zero-shot speech synthesis framework that uses a hierarchical speech synthesis pipeline, and by adopting this end to finish speech synthesis framework, the HierSpeech++ model is capable of maximize the potential of high-quality waveform generation to hierarchically bridge the gap between semantic and acoustic representations by adopting a self-supervised speech representation as a semantic speech representation, and thus attempts to resolve the present limitations of fashion adaptations. The tip to finish speech synthesis framework was first introduced by the VITS model, and it adopts a VAE or Variational Auto-Encoder augmented with adversarial training and normalizing flow. Moreover, VAE-based frameworks with an end to finish training pipeline have the potential to generate high-quality waveform audio with the perceptual speech synthesis quality being significantly higher than those generated by other speech synthesis frameworks.
The audio reconstruction quality of those frameworks may be enhanced further by utilizing a hierarchical conditional Variational AutoEncoder as utilized in the HierSpeech framework. Despite their potential, end to finish training pipeline based models have certain limitations especially in a zero-shot setting as though they’ll synthesize speech samples with high-quality audio, the speaker similarity in zero-shot voice cloning tasks continues to be riddled with high computational complexity. Then again, diffusion-based speech synthesis models perform well when it comes to speaker adaptations but they’re still removed from perfect as they make use of an interactive generation process that slows down its inference speed, they are sometimes vulnerable to noisy data, and in consequence of the mismatch between training and inference of the two-stage generation process between the Mel-spectrogram and generated ground-truth the audio quality will not be on top of things.
To tackle the problems faced by its predecessors, the HierSpeech++ model employs a hierarchical speech synthesizer, a speech super-resolution, and a text to vec component, and introduces an improved hierarchical speech synthesizer built on the hierarchical conditional VAE or Variational AutoEncoder. In an attempt to reinforce the audio quality beyond the perceptual quality, the HierSpeech++ framework adopts a dual-audio to spice up the acoustic posterior, and enhances out of distribution generalization by employing a hierarchical adaptive generator equipped with each conditional and unconditional generation. Moreover, to disentangle speech components, and enhance speaker-related & speaker-agnostic semantic information, the HierSpeech++ framework also adopts a source-filter theory-based multi-path semantic encoder. Because of this of employing a Variational AutoEncoder, the HierSpeech++ model can connect and learn representations hierarchically, and progressively adapt to the goal voice style to infer the waveform audio. Moreover, the HierSpeech++ framework also deploys a bidirectional network of normalizing flow Transformers in an attempt to reinforce adaptation, and in addition reduce the mismatch between training and inference.
Overall, the HierSpeech++ model is a fully-parallel, novel and robust hierarchical speech synthesis framework geared toward synthesizing speech samples in a zero-shot setting, and attempts to make the next contributions
- Using a hierarchical speech synthesis framework to regulate and transfer voice styles and prosody.
- Enable data scalability, and high-resolution speech synthesis by upsampling the waveform audio from 16 to 48 kHz.
- Achieve human-level ability across zero-shot voice conversion and text-to-speech tasks.
HierSpeech++ : Model Components and Architecture
As discussed, HierSpeech++ is a zero-shot speech synthesis model that attempts to realize human-level accuracy when it comes to voice similarity and speech naturalness.
The HierSpeech++ model consists of various components including a hierarchical speech synthesizer, a speech super resolution, and text-to-vec to TTV that work in sync with each other to facilitate the training of every model that may effectively utilize a considerable amount of low-resolution speech data for voice cloning. Let’s break down the framework, and speak about each component.
Speech Representations
Because the human frequency band is under 4 kHz, for speech synthesis, the HierSpeech++ framework downsamples the audio at 16 kHz. Moreover for reconstructing the voice signal, it’s vital to make use of a minimum of double the best component of voice frequency along with downsampling the audio sample. To realize enhanced perceptual quality, the HierSpeech++ framework makes use of a speech super resolution or SpeechSR component to upsample the audio sample from 16 to 48 kHz, and makes use of low-resolution representations for semantic and acoustic representations.
For acoustic representations, a standard text to speech or TTS framework employs a Mel-spectrogram as its intermediate acoustic feature that’s then transformed from the waveform with the assistance of a STFT or Short-Time Fourier Transform. Nonetheless, it’s value noting that since acoustic features are wealthy representations comprising various attributes including content and pronunciation, voice information, and more that makes it difficult for the framework to infer these representations, a situation that usually results in mispronunciations, lack of similarity, or over-smoothing of the speech.
Moving along, to extract a continuous semantic representation from a waveform, the HierSpeech++ framework utilizes a Wav2Vec framework in contrast to the favored self-supervised speech representation approach for semantic representations. Although the approach does make a very good alternative for a wealthy monolingual model, it affects the zero-shot voice cloning abilities of a model when it comes to each robustness and expressiveness especially on multilingual speech synthesis tasks.
Hierarchical Speech Synthesizer
The Hierarchical Speech Synthesizer component is the muse stone for the HierSpeech++ framework because it allows training the module without using any labels like text transcripts or speaker id, and relying solely on speech data. To extend the acoustic capability, previous state-of-the-art speech synthesis models replaced the Mel-spectrogram with a linear spectrogram, nonetheless, the approach minimizes the KL divergence rating when it comes to pitch periodicity, PESQ, voice and unvoice rating, and even Mel-spectrogram distance. The Hierarchical Speech Synthesizer employs a Dual-audio Acoustic Encoder to resolve the challenges presented by utilizing a linear spectrogram designed to capture richer and more comprehensive acoustic representations. The framework also employs a waveform encoder to distill information from a raw waveform audio, and concatenates it with the linear spectrogram representation, and at last projects the acoustic representation as a concatenated representation.
Moreover, to cope with speaker-agnostic, and speaker-related semantic representations, the HierSpeech++ framework utilizes a multi-path self-supervised speech representation where each individual representation is used for hierarchical style adaptation with the semantic representations extracted to acquire linguistic information from the center layer of the MMS. The framework also utilizes a fundamental frequency to reinforce speech disentanglement that allows controlling the pitch contour manually. The framework also uses a linguistic representation as conditional information to generate waveform audio hierarchically, and uses an enhanced linguistic representation of the self-supervised representation. Additionally it is value noting that the acoustic representations extracted during training by utilizing a waveform and linear spectrogram is used to reconstruct the raw waveform audio, and a hierarchical variational inference is used to link the acoustic representations with the multi-path linguistic representations. The framework also employs a hierarchical adaptive generator(HAG) to generate semantic-to-waveform samples, and the generated representations comprising a method representation, and an acoustic representation are fed to the source and waveform generators.
Text to Vec
For text to speech synthesis, the HierSpeech++ framework employs a text to vec or TTV model that generates a fundamental frequency and a semantic representation from a text sequence, and utilizes a monotonic alignment search coupled with a variational autoencoder to align the speech and text internally. The HierSpeech++ framework then replaces the linear spectrogram with a self-supervised linear representation, and reconstructs the identical representation to serve because the output for the TTV.
Moreover, the HierSpeech++ framework predicts the elemental frequency with 4 times larger resolutions compared to the self-supervised speech representations, and makes use of a conditional text representation because the prior information. Because of this of the semantic information of self-supervised speech representations, the framework is able to transferring the prosody style within the text to vec model, and feeds a latent representation to the phoneme encoder to reinforce the linguistic capabilities of the representation.
SpeechSR or Speech Super Resolution
The HierSpeech++ framework trains on a comparatively low-resolution dataset when it comes to data efficiency and availability, and up-samples a low-resolution speech waveform to a high-resolution speech waveform from 16 to 48 kHz. The framework also replaces a transposed convolution with the closest neighbor upsampler that has previously been known to alleviate artifacts in consequence of transposed convolutions.
Architecture
The content encoder of the text to vec model consists of 16 non-casual WaveNet layers with a kernel size of 5 and a hidden size of 256 whereas the content decoder consists of 8 non-casual WaveNet layers with a kernel size of 5, and a hidden size of 512. The text encoder component consists of three prosody conditional Transformer networks and three unconditional Transformer networks with a kernel size of 9, filter size of 1024, and a hidden size of 256 with the text encoder having a dropout rate of 0.2. To encode adjoining information, and to reinforce prosody style adaptation, the framework adopts a CNN with a kernel size of 5 in Transformer blocks. The SpeechSR alternatively comprises a single AMP block with 32 initial channels without the presence of an upsampling layer. The framework makes use of a nearest neighbor upsampler to upsample the hidden representations and utilizes a MPD because the discriminator with six different window sizes, and 4 sub-band discriminators.
The above figure demonstrates the inference pipeline of the HierSpeech++ framework that starts with extracting the semantic representations from the audio at a frequency of 16 kHz, and at the elemental frequency by making use of the YAPPT algorithm. Before the elemental frequency may be fed to the Hierarchical Synthesizer, it’s normalized using the usual and mean deviations of the source audio, and the normalized fundamental frequency is then denormalized by utilizing the usual and mean deviation of the goal audio. For text to speech extractions, the HierSpeech++ framework extracts textual representations as an alternative of speech representations, and employs the text to vec model to generate a semantic representation from the prosody prompt.
Experiment and Results
The framework utilizes the publicly available LibriTTS dataset to coach the hierarchical synthesizer component with step one being training the model with trainclean subsets of the dataset, and utilizing the remaining data to enable enhanced transfer of the voice style. Moreover, to enhance the variety and robustness, the framework upscales the dataset to 1 kHz as demonstrated in the next figure.
Reconstruction, Resynthesis Tasks, and Voice Conversion
To judge the performance of the HierSpeech++ framework on reconstruction and resynthesizing tasks, developers conducted seven objective metrics, and the outcomes are demonstrated in the next figures for reconstruction and resynthesizing tasks respectively.
For Voice Conversion tasks, the framework uses two subjective metrics for evaluation: voice similarity MOS or sMOS and naturalness mean opinion rating of nMOS with three naturalness objective metrics, and two similarity objective metrics.
Moving along, the first aim of the HierSpeech++ framework is to enable zero-shot speech synthesis, and to guage its performance in zero-shot, it’s compared against other basemodels like AutoVC, VoiceMixer, Diffusion-based models, and loads more with the outcomes being demonstrated in the next figure.
The next figures reveal the zero-shot text to speech results with noisy prompts, and really noisy prompts respectively.
Final Thoughts
In this text, we’ve got talked in regards to the HierSpeech++ model, a novel approach to enable robust, and effective speech synthesis in a zero-shot setting, and overcome the restrictions faced by current speech synthesis frameworks including their over-reliance on large amounts of coaching data, reliance on discrete speech units or pre-trained neural audio codec, and their tendency to auto-generate audio output that ultimately causes a scarcity of robustness and slow interference speeds and leads to mispronunciation, skipping, or repeating. The HierSpeech++ model is a fully-parallel, novel and robust hierarchical speech synthesis framework geared toward synthesizing speech samples in a zero-shot setting, and attempts to make the next contributions
- Using a hierarchical speech synthesis framework to regulate and transfer voice styles and prosody.
- Enable data scalability, and high-resolution speech synthesis by upsampling the waveform audio from 16 to 48 kHz.
- Achieve human-level ability across zero-shot voice conversion and text-to-speech tasks.