In Text-to-Speech synthesis (TTS), Fast Voice Cloning (IVC) enables the TTS model to clone the voice of any reference speaker using a brief audio sample, without requiring additional training for the reference speaker. This method can also be referred to as Zero-Shot Text-to-Speech Synthesis. The Fast Voice Cloning approach allows for flexible customization of the generated voice and demonstrates significant value across a wide selection of real-world situations, including customized chatbots, content creation, and interactions between humans and Large Language Models (LLMs).
Although the present voice cloning frameworks do their job well, they’re riddled with a couple of challenges in the sphere including Flexible Voice Style Control i.e models lack the flexibility to govern voice styles flexibly after cloning the voice. One other major roadblock encountered by current easy cloning frameworks is Zero-Shot Cross-Lingual Voice Cloning i.e for training purposes, current models require access to an intensive massive-speaker multi-lingual or MSML dataset no matter the language.
To tackle these issues, and contribute within the enhancement of easy voice cloning models, developers have worked on OpenVoice, a flexible easy voice cloning framework that replicates the voice of any user and generates speech in multiple languages using a brief audio clip from the reference speaker. OpenVoice demonstrates Fast Voice Cloning models can replicate the tone color of the reference speaker, and achieve granular control over voice styles including accent, rhythm, intonation, pauses, and even emotions. What’s more impressive is that the OpenVoice framework also demonstrates remarkable capabilities in achieving zero-shot cross-lingual voice cloning for languages external to the MSML dataset, allowing OpenVoice to clone voices into latest languages without extensive pre-training for that language. OpenVoice manages to deliver superior easy voice cloning results while being computationally viable with operating costs as much as 10 times less that current available APIs with inferior performance.
In this text, we’ll talk concerning the OpenVoice framework in depth, and we’ll uncover its architecture that enables it to deliver superior performance across easy voice cloning tasks. So let’s start.
As mentioned earlier, Fast Voice Cloning, also known as Zero-Shot Text to Speech Synthesis, allows the TTS model to clone the voice of any reference speaker using a brief audio sample without the necessity of any additional training for the reference speaker. Fast Voice Cloning has at all times been a hot research topic with existing works including XTTS and VALLE frameworks that extract speaker embedding and/or acoustic tokens from the reference audio that serves as a condition for the auto-regressive model. The auto-regressive model then generates acoustic tokens sequentially, after which decodes these tokens right into a raw audio waveform.
Although auto-regressive easy voice cloning models clone the tone color remarkably, they fall short in manipulating other style parameters including accent, emotion, pauses, and rhythm. Moreover, auto-regressive models also experience low inference speed, and their operational costs are quite high. Existing approaches like YourTTS framework employ a non-autoregressive approach that demonstrates significantly faster inference speech over autoregressive approach frameworks, but are still unable to supply their users with flexible control over style parameters. Furthermore, each autoregressive-based and non-autoregressive based easy voice cloning frameworks need access to a big MSML or massive-speaker multilingual dataset for cross-lingual voice cloning.
To tackle the challenges faced by current easy voice cloning frameworks, developers have worked on OpenVoice, an open source easy voice cloning library that goals to resolve the next challenges faced by current IVC frameworks.
- The primary challenge is to enable IVC frameworks to have flexible control over style parameters along with tone color including accent, rhythm, intonation, and pauses. Style parameters are crucial to generate in-context natural conversations and speech quite than narrating the input text monotonously.
- The second challenge is to enable IVC frameworks to clone cross-lingual voices in a zero-shot setting.
- The ultimate challenge is to realize high real-time inference speeds without deteriorating the standard.
To tackle the primary two hurdles, the architecture of the OpenVoice framework is designed in a approach to decouple components within the voice to the very best of its abilities. Moreover, OpenVoice generates tone color, language, and other voice features independently, enabling the framework to flexibly manipulate individual language types and voice styles. The OpenVoice framework tackles the third challenge by default because the decoupled structure reduces computational complexity and model size requirements.
OpenVoice : Methodology and Architecture
The technical framework of the OpenVoice framework is effective and surprisingly easy to implement. It isn’t any secret that cloning the tone color for any speaker, adding latest language, and enabling flexible control over voice parameters concurrently might be difficult. It’s so because executing these three tasks concurrently requires the controlled parameters to intersect using a big chunk of combinatorial datasets. Moreover, in regular single speaker text to speech synthesis, for tasks that don’t require voice cloning, it is simpler so as to add control over other style parameters. Constructing on these, the OpenVoice framework goals to decouple the Fast Voice Cloning tasks into subtasks. The model proposes to make use of a base speaker Text to Speech model to regulate the language and magnificence parameters, and employs a tone color converter to incorporate the reference tone color into the voice generated. The next figure demonstrates the architecture of the framework.
At its core, the OpenVoice framework employs two components: a tone color converter, and a base speaker text to speech or TTS model. The bottom speaker text to speech model is either a single-speaker or a multi-speaker model allowing precise control over style parameters, language, and accent. The model generates a voice that’s then passed on to the tone color converter, that changes the bottom speaker tone color to the tone color of the reference speaker.
The OpenVoice framework offers quite a lot of flexibility in terms of the bottom speaker text to speech model since it could employ the VITS model with slight modification allowing it to just accept language and magnificence embeddings in its duration predictor and text encoder. The framework also can employ models like Microsoft TTS which can be commercially low cost or it could deploy models like InstructTTS which can be able to accepting style prompts. In the meanwhile, the OpenVoice framework employs the VITS model although the opposite models are also a feasible option.
Coming to the second component, the Tone Color Converter is an encoder-decoder component housing an invertible normalizing flow in the middle. The encoder component within the tone color converter is a one-dimensional CNN that accepts the short-time fourier transformed spectrum of the bottom speaker text to speech model as its input. The encoder then generates feature maps as output. The tone color extractor is a straightforward two-dimensional CNN that operates on the mel-spectrogram of the input voice, and generates a single feature vector because the output that encodes the knowledge of the tone color. The normalizing flow layers accept the feature maps generated by the encoder because the input and generate a feature representation that preserves all style properties but eliminates the tone color information. The OpenVoice framework then applies the normalizing flow layers within the inverse direction, and takes the feature representations because the input and outputs the normalizing flow layers. The framework then decodes the normalizing flow layers into raw waveforms using a stack of transposed one-dimensional convolutions.
All the architecture of the OpenVoice framework is feed forward without using any auto-regressive component. The tone color converter component is comparable to voice conversion on a conceptual level but differs when it comes to functionality, training objectives, and an inductive bias within the model structure. The normalizing flow layers share the identical structure as flow-based text to speech models but differ when it comes to functionality and training objectives.
Moreover, there exists a distinct approach to extract feature representations, the strategy implemented by the OpenVoice framework delivers higher audio quality. Additionally it is price noting that the OpenVoice framework has no intention of inventing components within the model architecture, quite each the primary components i.e. the tone color converter and the bottom speaker TTS model are each sourced from existing works. The first aim of the OpenVoice framework is to form a decoupled framework that separates the language control and the voice style from the tone color cloning. Although the approach is sort of easy, it is sort of effective especially on tasks that control styles and accents, or latest language generalization tasks. Achieving the identical control when employing a coupled framework requires a considerable amount of computing and data, and it doesn’t generalize well to latest languages.
At its core, the primary philosophy of the OpenVoice framework is to decouple the generation of language and voice styles from the generation of tone color. One among the foremost strengths of the OpenVoice framework is that the clone voice is fluent and of top quality so long as the single-speaker TTS speaks fluently.
OpenVoice : Experiment and Results
Evaluating voice cloning tasks is a tough objective on account of quite a few reasons. For starters, existing works often employ different training and test data that makes comparing these works intrinsically unfair. Although crowd-sourcing might be used to guage metrics like Mean Opinion Rating, the problem and variety of the test data will influence the general final result significantly. Second, different voice cloning methods have different training data, and the variety and scale of this data influences the outcomes significantly. Finally, the first objective of existing works often differs from each other, hence they differ of their functionality.
On account of the three reasons mentioned above, it’s unfair to check existing voice cloning frameworks numerically. As an alternative, it makes rather more sense to check these methods qualitatively.
Accurate Tone Color Cloning
To investigate its performance, developers construct a test set with anonymous individuals, game characters and celebrities form the reference speaker base, and has a large voice distribution including each neutral samples and unique expressive voices. The OpenVoice framework is capable of clone the reference tone color and generate speech in multiple languages and accents for any of the reference speakers and the 4 base speakers.
Flexible Control on Voice Styles
One among the objectives of the OpenVoice framework is to regulate the speech styles flexibly using the tone color converter that may modify the colour tone while preserving all other voice features and properties.
Experiments indicate that the model preserves the voice styles after converting to the reference tone color. In some cases nonetheless, the model neutralizes the emotions barely, an issue that might be resolved by passing less information to the flow layers so that they’re unable to do away with the emotion. The OpenVoice framework is capable of preserve the styles from the bottom voice due to its use of a tone color converter. It allows the OpenVoice framework to govern the bottom speaker text to speech model to simply control the voice styles.
Cross-Lingual Voice Clone
The OpenVoice framework doesn’t include any massive-speaker data for an unseen language, yet it’s capable of achieve near cross-lingual voice cloning in a zero-shot setting. The cross-lingual voice cloning capabilities of the OpenVoice framework are two folds:
- The model is capable of clone the tone color of the reference speaker accurately when the language of the reference speaker goes unseen within the multi-speaker multi language or MSML dataset.
- Moreover, in the identical event of the language of the reference speaker goes unseen, the OpenVoice framework is able to cloning the voice of the reference speaker, and speak within the language one the condition that the bottom speaker text to speech model supports the language.
Final Thoughts
In this text now we have talked about OpenVoice, a flexible easy voice cloning framework that replicates the voice of any user and generates speech in multiple languages using a brief audio clip from the reference speaker. The first intuition behind OpenVoice is that so long as a model doesn’t need to perform tone color cloning of the reference speaker, a framework can employ a base speaker TTS model to regulate the language and the voice styles.
OpenVoice demonstrates Fast Voice Cloning models can replicate the tone color of the reference speaker, and achieve granular control over voice styles including accent, rhythm, intonation, pauses, and even emotions. OpenVoice manages to deliver superior easy voice cloning results while being computationally viable with operating costs as much as 10 times less that current available APIs with inferior performance.