Home Community This AI Paper from Google Unveils a Groundbreaking Non-Autoregressive, LM-Fused ASR System for Superior Multilingual Speech Recognition

This AI Paper from Google Unveils a Groundbreaking Non-Autoregressive, LM-Fused ASR System for Superior Multilingual Speech Recognition

0
This AI Paper from Google Unveils a Groundbreaking Non-Autoregressive, LM-Fused ASR System for Superior Multilingual Speech Recognition

The evolution of technology in speech recognition has been marked by significant strides, but challenges like latency the time delay in processing spoken language, have continually impeded progress. This latency is very pronounced in autoregressive models, which process speech sequentially, resulting in delays. These delays are detrimental in real-time applications like live captioning or virtual assistants, where immediacy is vital. Addressing this latency without compromising accuracy stays critical in advancing speech recognition technology.

A pioneering approach in speech recognition is developing a non-autoregressive model, a departure from traditional methods. This model, proposed by a team of researchers from Google Research, is designed to tackle the inherent latency issues present in existing systems. It utilizes large language models and leverages parallel processing, which processes speech segments concurrently somewhat than sequentially. This similar processing approach is instrumental in reducing latency, offering a more fluid and responsive user experience.

The core of this revolutionary model is the fusion of the Universal Speech Model (USM) with the PaLM 2 language model. The USM, a sturdy model with 2 billion parameters, is designed for accurate speech recognition. It uses a vocabulary of 16,384-word pieces and employs a Connectionist Temporal Classification (CTC) decoder for parallel processing. The USM is trained on an intensive dataset, encompassing over 12 million hours of unlabeled audio and 28 billion sentences of text data, making it incredibly adept at handling multilingual inputs.

The PaLM 2 language model, known for its prowess in natural language processing, complements the USM. It’s trained on diverse data sources, including web documents and books, and employs a big 256,000 wordpiece vocabulary. The model stands out for its ability to attain Automatic Speech Recognition (ASR) hypotheses using a prefix language model scoring mode. This method involves prompting the model with a hard and fast prefix—top hypotheses from previous segments—and scoring several suffix hypotheses for the present segment. 

In practice, the combined system processes long-form audio in 8-second chunks. As soon because the audio is obtainable, the USM encodes it, and these segments are then relayed to the CTC decoder. The decoder forms a confusion network lattice encoding possible word pieces, which the PaLM 2 model scores. The system updates every 8 seconds, providing a near real-time response.

The performance of this model was rigorously evaluated across several languages and datasets, including YouTube captioning and the FLEURS test set. The outcomes were remarkable. A mean improvement of 10.8% in relative word error rate (WER) was observed on the multilingual FLEURS test set. For the YouTube captioning dataset, which presents a more difficult scenario, the model achieved a median improvement of three.6% across all languages. These improvements are a testament to the model’s effectiveness in diverse languages and settings.

The study delved into various aspects affecting the model’s performance. It explored the impact of language model size, starting from 128 million to 340 billion parameters. It found that while larger models reduced sensitivity to fusion weight, the gains in WER won’t offset the increasing inference costs. The optimal LLM scoring weight also shifted with model size, suggesting a balance between model complexity and computational efficiency.

In conclusion, this research presents a major leap in speech recognition technology. Its highlights include:

  • A non-autoregressive model combining the USM and PaLM 2 for reduced latency.
  • Enhanced accuracy and speed, making it suitable for real-time applications.
  • Significant improvements in WER across multiple languages and datasets.

This model’s revolutionary approach to processing speech in parallel, coupled with its ability to handle multilingual inputs efficiently, makes it a promising solution for various real-world applications. The insights provided into system parameters and their effects on ASR efficacy add precious knowledge to the sphere, paving the way in which for future advancements in speech recognition technology. 


Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

For those who like our work, you’ll love our newsletter..

Don’t Forget to hitch our Telegram Channel


Hello, My name is Adnan Hassan. I’m a consulting intern at Marktechpost and shortly to be a management trainee at American Express. I’m currently pursuing a dual degree on the Indian Institute of Technology, Kharagpur. I’m enthusiastic about technology and wish to create latest products that make a difference.


🧑‍💻 [FREE AI WEBINAR] ‘Construct Real-Time Document/Image Analytics with GPT-4 Vision’ (Jan 29, 2024)

LEAVE A REPLY

Please enter your comment!
Please enter your name here