Explore the cutting-edge multilingual features of Meta’s latest automatic speech recognition (ASR) model
Massively Multilingual Speech (MMS)¹ is the newest release by Meta AI (just just a few days ago). It pushes the boundaries of speech technology by expanding its reach from about 100 languages to over 1,000. This was achieved by constructing a single multilingual speech recognition model. The model may also discover over 4,000 languages, representing a 40-fold increase over previous capabilities.
The MMS project goals to make it easier for people to access information and use devices of their preferred language. It expands text-to-speech and speech-to-text technology to underserved languages, continuing to scale back language barriers in our global world. Existing applications can now include a greater variety of languages, similar to virtual assistants or voice-activated devices. At the identical time, recent use cases emerge in cross-cultural communication, for instance, in messaging services or virtual and augmented reality.
In this text, we are going to walk through the usage of MMS for ASR in English and Portuguese and supply a step-by-step guide on organising the environment to run the model.
This text belongs to “Large Language Models Chronicles: Navigating the NLP Frontier”, a brand new weekly series of articles that may explore methods to leverage the facility of enormous models for various NLP tasks. By diving into these cutting-edge technologies, we aim to empower developers, researchers, and enthusiasts to harness the potential of NLP and unlock recent possibilities.
Articles published to this point:
- Summarizing the newest Spotify releases with ChatGPT
- Master Semantic Search at Scale: Index Hundreds of thousands of Documents with Lightning-Fast Inference Times using FAISS and Sentence Transformers
- Unlock the Power of Audio Data: Advanced Transcription and Diarization with Whisper, WhisperX, and PyAnnotate
- Whisper JAX vs PyTorch: Uncovering the Truth about ASR Performance on GPUs
- Vosk for Efficient Enterprise-Grade Speech Recognition: An Evaluation and Implementation Guide
As at all times, the code is offered on my Github.
Meta used religious texts, similar to the Bible, to construct a model covering this wide selection of languages. These texts have several interesting components: first, they’re translated into many languages, and second, there are publicly available audio recordings of individuals reading these texts in numerous languages. Thus, the essential dataset where this model was trained was the Latest Testament, which the research team was in a position to collect for over 1,100 languages and provided greater than 32h of knowledge per language. They went further to make it recognize 4,000 languages. This was done through the use of unlabeled recordings of varied other Christian religious readings. From the experiments results, regardless that the info is from a selected domain, it may generalize well.
These should not the one contributions of the work. They created a brand new preprocessing and alignment model that may handle long recordings. This was used to process the audio, and misaligned data was removed using a final cross-validation filtering step. Recall from certainly one of our previous articles that we saw that certainly one of the challenges of Whisper was the incapacity to align the transcription properly. One other essential step of the approach was the usage of wav2vec 2.0, a self-supervised learning model, to coach their system on an enormous amount of speech data (about 500,000 hours) in over 1,400 languages. The labeled dataset we discussed previously isn’t enough to coach a model of the dimensions of MMS, so wav2vec 2.0 was used to scale back the necessity for labeled data. Finally, the resulting models were then fine-tuned for a selected speech task, similar to multilingual speech recognition or language identification.
The MMS models were open-sourced by Meta just a few days ago and were made available within the Fairseq repository. In the following section, we cover what Fairseq is and the way we will test these recent models from Meta.
Fairseq is an open-source sequence-to-sequence toolkit developed by Facebook AI Research, also generally known as FAIR. It provides reference implementations of varied sequence modeling algorithms, including convolutional and recurrent neural networks, transformers, and other architectures.
The Fairseq repository is predicated on PyTorch, one other open-source project initially developed by the Meta and now under the umbrella of the Linux Foundation. It’s a really powerful machine learning framework that provides high flexibility and speed, particularly on the subject of deep learning.
The Fairseq implementations are designed for researchers and developers to coach custom models and it supports tasks similar to translation, summarization, language modeling, and other text generation tasks. Certainly one of the important thing features of Fairseq is that it supports distributed training, meaning it may efficiently utilize multiple GPUs either on a single machine or across multiple machines. This makes it well-suited for large-scale machine learning tasks.
Fairseq provides two pre-trained models for download: MMS-300M and MMS-1B. You furthermore may have access to fine-tuned models available for various languages and datasets. For our purpose, we test the MMS-1B model fine-tuned for 102 languages within the FLEURS dataset and likewise the MMS-1B-all, which is fine-tuned to handle 1162 languages (!), fine-tuned using several different datasets.
Do not forget that these models are still in research phase, making testing a bit more difficult. There are additional steps that you just wouldn’t find with production-ready software.
First, it is advisable to arrange a .env
file in your project root to configure your environment variables. It should look something like this:
CURRENT_DIR=/path/to/current/dir
AUDIO_SAMPLES_DIR=/path/to/audio_samples
FAIRSEQ_DIR=/path/to/fairseq
VIDEO_FILE=/path/to/video/file
AUDIO_FILE=/path/to/audio/file
RESAMPLED_AUDIO_FILE=/path/to/resampled/audio/file
TMPDIR=/path/to/tmp
PYTHONPATH=.
PREFIX=INFER
HYDRA_FULL_ERROR=1
USER=micro
MODEL=/path/to/fairseq/models_new/mms1b_all.pt
LANG=eng
Next, it is advisable to configure the YAML file positioned at fairseq/examples/mms/asr/config/infer_common.yaml
. This file accommodates essential settings and parameters utilized by the script.
Within the YAML file, use a full path for the checkpoint
field like this (unless you’re using a containerized application to run the script):
checkpoint: /path/to/checkpoint/${env:USER}/${env:PREFIX}/${common_eval.results_path}
This full path is obligatory to avoid potential permission issues unless you’re running the appliance in a container.
In the event you plan on using a CPU for computation as an alternative of a GPU, you’ll need so as to add the next directive to the highest level of the YAML file:
common:
cpu: true
This setting directs the script to make use of the CPU for computations.
We use the dotevn
python library to load these environment variables in our Python script. Since we’re overwriting some system variables, we are going to need to make use of a trick to make sure that that we get the appropriate variables loaded. We use thedotevn_values
method and store the output in a variable. This ensures that we get the variables stored in our .env
file and never random system variables even in the event that they have the identical name.
config = dotenv_values(".env")current_dir = config['CURRENT_DIR']
tmp_dir = config['TMPDIR']
fairseq_dir = config['FAIRSEQ_DIR']
video_file = config['VIDEO_FILE']
audio_file = config['AUDIO_FILE']
audio_file_resampled = config['RESAMPLED_AUDIO_FILE']
model_path = config['MODEL']
model_new_dir = config['MODELS_NEW']
lang = config['LANG']
Then, we will clone the fairseq GitHub repository and install it in our machine.
def git_clone(url, path):
"""
Clones a git repositoryParameters:
url (str): The URL of the git repository
path (str): The local path where the git repository can be cloned
"""
if not os.path.exists(path):
Repo.clone_from(url, path)
def install_requirements(requirements):
"""
Installs pip packages
Parameters:
requirements (list): List of packages to put in
"""
subprocess.check_call(["pip", "install"] + requirements)
git_clone('https://github.com/facebookresearch/fairseq', 'fairseq')
install_requirements(['--editable', './'])
We already discussed the models that we use in this text, so let’s download them to our local environment.
def download_file(url, path):
"""
Downloads a fileParameters:
url (str): URL of the file to be downloaded
path (str): The trail where the file can be saved
"""
subprocess.check_call(["wget", "-P", path, url])
download_file('https://dl.fbaipublicfiles.com/mms/asr/mms1b_fl102.pt', model_new_dir)
There may be one additional restriction related to the input of the MMS model, the sampling rate of the audio data must be 16000 Hz. In our case, we defined two ways to generate these files: one which converts video to audio and one other that resamples audio files for the proper sampling rate.
def convert_video_to_audio(video_path, audio_path):
"""
Converts a video file to an audio fileParameters:
video_path (str): Path to the video file
audio_path (str): Path to the output audio file
"""
subprocess.check_call(["ffmpeg", "-i", video_path, "-ar", "16000", audio_path])
def resample_audio(audio_path, new_audio_path, new_sample_rate):
"""
Resamples an audio file
Parameters:
audio_path (str): Path to the present audio file
new_audio_path (str): Path to the output audio file
new_sample_rate (int): Latest sample rate in Hz
"""
audio = AudioSegment.from_file(audio_path)
audio = audio.set_frame_rate(new_sample_rate)
audio.export(new_audio_path, format='wav')
We are actually able to run the inference process using our MMS-1B-all model, which supports 1162 languages.
def run_inference(model, lang, audio):
"""
Runs the MMS ASR inferenceParameters:
model (str): Path to the model file
lang (str): Language of the audio file
audio (str): Path to the audio file
"""
subprocess.check_call(
[
"python",
"examples/mms/asr/infer/mms_infer.py",
"--model",
model,
"--lang",
lang,
"--audio",
audio,
]
)
run_inference(model_path, lang, audio_file_resampled)
On this section, we describe our experimentation setup and discuss the outcomes. We performed ASR using two different models from Fairseq, MMS-1B-all and MMS-1B-FL102, in each English and Portuguese. You could find the audio files in my GitHub repo. These are files that I generated myself only for testing purposes.
Let’s start with the MMS-1B-all model. Here is the input and output for the English and Portuguese audio samples:
Eng: just requiring a small clip to grasp if the brand new facebook research model really performs on
Por: ora bem só agravar aqui um exemplo pa tentar perceber se de facto om novo modelo da facebook research realmente funciona ou não vamos estar
With the MMS-1B-FL102, the generated speech was significantly worse. Let’s see the identical example for English:
Eng: just recarding a small ho clip to grasp if the brand new facebuok research model really performs on speed recognition tasks lets see
While the speech generated isn’t super impressive for the usual of models we have now today, we want to handle these results from the attitude that these models open up ASR to a much wider range of the worldwide population.
The Massively Multilingual Speech model, developed by Meta, represents yet another step to foster global communication and broaden the reach of language technology using AI. Its ability to grasp over 4,000 languages and performance effectively across 1,162 of them increases accessibility for varied languages which were traditionally underserved.
Our testing of the MMS models showcased the probabilities and limitations of the technology at its current stage. Although the speech generated by the MMS-1B-FL102 model was not as impressive as expected, the MMS-1B-all model provided promising results, demonstrating its capability to transcribe speech in each English and Portuguese. Portuguese has been certainly one of those underserved languages, specially after we consider Portuguese from Portugal.
Be happy to try it out in your chosen language and to share the transcription and feedback within the comment section.
Be in contact: LinkedIn
[1] — Pratap, V., Tjandra, A., Shi, B., Tomasello, P., Babu, A., Kundu, S., Elkahky, A., Ni, Z., Vyas, A., Fazel-Zarandi, M., Baevski, A., Adi, Y., Zhang, X., Hsu, W.-N., Conneau, A., & Auli, M. (2023). Scaling Speech Technology to 1,000+ Languages. arXiv.