
Recent advances in language models showcase impressive zero-shot voice conversion (VC) capabilities. Nevertheless, prevailing VC models rooted in language models normally utilize offline conversion from source semantics to acoustic features, necessitating the whole lot of the source speech and limiting their application to real-time scenarios.
On this research, a team of researchers from Northwestern Polytechnical University, China, and ByteDance introduce StreamVoice. StreamVoice is a novel streaming language model (LM)-based method for zero-shot voice conversion (VC), allowing real-time conversion with any speaker prompts and source speech. StreamVoice achieves streaming capability by employing a totally causal context-aware LM with a temporal-independent acoustic predictor.
This model alternately processes semantic and acoustic features at each autoregression time step, eliminating the necessity for complete source speech. To mitigate potential performance degradation in streaming processing resulting from incomplete context, two strategies are employed:
1) teacher-guided context foresight, where a teacher model summarises present and future semantic context during training to guide the model’s forecasting for missing context.
2) semantic masking strategy, promoting acoustic prediction from preceding corrupted semantic and acoustic input to boost context-learning ability. Notably, StreamVoice stands out as the primary LM-based streaming zero-shot VC model with none future look-ahead. Experimental results showcase StreamVoice’s streaming conversion capability while maintaining zero-shot performance comparable to non-streaming VC systems.
The above figure demonstrates the concept of the streaming zero-shot VC employing the widely used recognition-synthesis framework. StreamVoice is built on this popular paradigm. The experiments conducted illustrate that StreamVoice exhibits the aptitude to conduct speech conversion in a streaming fashion, achieving high speaker similarity for each familiar and unfamiliar speakers. It maintains performance levels comparable to non-streaming voice conversion (VC) systems. Because the initial language model (LM)-based zero-shot VC model with none future lookahead, StreamVoice’s entire pipeline incurs only 124 ms latency for the conversion process. That is notably 2.4 times faster than real-time on a single A100 GPU, even without engineering optimizations. The team’s future work involves using more training data to enhance StreamVoice’s modeling ability. Additionally they plan to optimize the streaming pipeline, incorporating a high-fidelity codec with a low bitrate and a unified streaming model.
Try the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
For those who like our work, you’ll love our newsletter..
Don’t Forget to affix our Telegram Channel
Janhavi Lande, is an Engineering Physics graduate from IIT Guwahati, class of 2023. She is an upcoming data scientist and has been working on this planet of ml/ai research for the past two years. She is most fascinated by this ever changing world and its constant demand of humans to maintain up with it. In her pastime she enjoys traveling, reading and writing poems.