Salmonn: Towards Generic Hearing Abilities For Large Language Models SALMONN : An Introduction to Single Audio-Text Multimodal Large Language Models

News

Salmonn: Towards Generic Hearing Abilities For Large Language Models SALMONN : An Introduction to Single Audio-Text Multimodal Large Language Models

admin

November 28, 2023

Salmonn: Towards Generic Hearing Abilities For Large Language Models
SALMONN : An Introduction to Single Audio-Text Multimodal Large Language Models

Hearing, which involves the perception and understanding of generic auditory information, is crucial for AI agents in real-world environments. This auditory information encompasses three primary sound types: music, audio events, and speech. Recently, text-based Large Language Model (LLM) frameworks have shown remarkable abilities, achieving human-level performance in a wide selection of Natural Language Processing (NLP) tasks. Moreover, instruction tuning, a training method using pairs of reference responses and user prompts, has grow to be popular. This approach trains large language models to more effectively follow open-ended user instructions. Nevertheless, current research is increasingly focused on enhancing large language models with the potential to perceive multimodal content.

Specializing in the identical, in this text, we will likely be talking about SALMONN or Speech Audio Language Music Open Neural Network, a cutting-edge open speech audio language music neural network built by incorporating speech and audio encoders with a pre-trained text-based large language model right into a singular audio-text multimodal model. The SALMONN model enables Large Language Models to grasp and process generic audio inputs directly, and deliver competitive performance on a wide selection of audio & speech tasks utilized in training including auditory information-based query answering, speech recognition and translation, speaker verification, emotion recognition, audio & music captioning, and way more. We will likely be taking a deeper dive into the SALMONN framework, and explore its working, architecture, and results across a wide selection of NLP tasks. So let’s start.

SALMONN stands for Speech Audio Language Music Open Neural Network, and it’s a single audio-text multimodal large language model framework able to perceiving and understanding three basic audio or sound types including speech, audio events, and music. The SALMONN model enables Large Language Models to grasp and process generic audio inputs directly, and deliver competitive performance on a wide selection of audio & speech tasks.

To spice up its performance on each speech, and non-speech audio tasks, the SALMONN framework employs a dual encoder structure consisting of a BEATs audio encoder, and a speech encoder sourced from the Whisper speech model. Moreover, the SALMONN framework also uses a window-level Q-Former or query Transformer as a connection module to effectively convert an output sequence of variable-length encoder to augmented audio tokens of a variable number, and ultimately achieve high temporal resolution for audio-text alignment. The LoRA or Low Rank Adaptation approach is used as a cross-modal adaptor to the Vicuna framework to align its output space with its augmented input space in an try to further boost its performance. Within the SALMONN framework, the flexibility to perform cross-modal tasks unseen in the course of the training phase lost during training of instructions as cross-modal emergent abilities which is the first reason why the SALMONN framework implements a further few-shot activation stage to regain the LLM framework’s general emergent abilities.

Moreover, the framework makes use of a wide selection of audio events, music benchmarks, and speech benchmarks to judge its cognitive hearing abilities, and divides the benchmarks in three levels. At the primary benchmark level, the framework trains eight tasks in instruction training including translation, audio captioning, and speech recognition. The opposite two benchmark levels are untrained tasks with the second level benchmark consisting of 5 speech-based Natural Language Processing tasks like slot filling and translation to untrained languages counting on high-quality multilingual alignments between text and speech tokens. The ultimate level benchmark tasks try to understand speech and non-speech auditory information for speech-audio co-reasoning and audio-based storytelling.

To sum it up, the SALMONN framework is

The primary multimodal large language model able to understanding and perceiving general audio inputs including audio events, speech, and music to the utmost of its ability.
An attempt to investigate cross-modal emergent abilities offered by implementing the LoRA scaling factor, and using an additional budget-friendly activation stage during training to activate cross-modal emergent abilities of the framework.

SALMONN : Architecture and Methodology

On this section, we will likely be having a have a look at the architecture, training method, and experimental setup for the SALMONN framework.

Model Architecture

On the core of its architecture, the SALMONN framework synchronizes and combines the outputs from two auditory encoders following which the framework implements a Q-Former on the frame level as a connection module. The output sequence generated by the Q-Former is merged with text instruction prompts and it’s then provided as an input to the LoRA adaptation approach to generate the required response.

Auditory Encoders

The SALMONN framework makes use of two auditory encoders: a non-speech BEATs audio encoder, and a speech encoder sourced from OpenAI’s Whisper framework. The BEATs audio encoder is trained to make use of the self-supervised iterative learning approach in an attempt extract non-speech high-level audio semantics whereas the speech encoder is trained on a high amount of weakly supervised data for speech recognition and speech translation tasks with the output features of the encoder suitable to incorporate background noise and speech information. The model first tokenizes the input audio, and follows it up by masking and predicting it in training. The resulting auditory features of those two encoders complement one another, and are suitable for each speech, and non-speech information.

Window Level Q-Former

Implementing the Q-Former structure is a standard approach utilized in the LLM frameworks to convert the output of a picture encoder into textual input tokens, and a few modification is required when coping with audio tokens of various lengths. To be more specific, the framework regards the encoder output of the input image as a concatenated encoder output sequence, and the Q-Former deploys a set variety of trainable queries to rework the encoder output sequence into textual tokens using stacked blocks of Q-Former. A stacked Q-Former block resembles a Transformer decoder block with the exceptions being removing casual masks within the self-attention layers, and using a set variety of trainable static queries within the initial blocks.

LoRA and LLM

The SALMONN framework also deploys a Vicuna LLM which is a LLaMA large language model framework fine-tuned to follow instructions more accurately, and effectively. The LoRA framework is a standard method used for parameter-efficient fine-tuning, and its inclusion within the SALMONN framework to value weight matrices and adapt the query within the self-attention layers.

Training Method

The SALMONN framework makes use of a three-stage cross-modal training approach. The training stage comprises a pre-training stage, and an instruction tuning stage which are included in most visual LLM frameworks, and a further activation tuning stage is implemented to resolve over-fitting issues encountered during audio captioning and speech recognition tasks.

Pre-Training Stage

To limit the gap observed between pre-trained parameters including encoders & LLM, and randomly initialized parameters including adaptor & connection modules, the SALMONN framework uses a considerable amount of audio captioning and speech recognition data to pre-train the LoRA and Q-Former components. These tasks contain vital auditory information in regards to the key contents of audio events each speech and non-speech, and neither of them require complex understanding or reasoning to learn alignment between textual and auditory information.

Instruction Nice-Tuning Stage

The instruction fine-tuning stage implemented within the SALMONN framework resembles the one implemented in NLP and visual LLM frameworks by utilizing an inventory of audio events, music tasks and speech events to fine-tune audi-text instructions. The tasks are prioritized on the premise of their importance across different tests including phone recognition, overlapping speech recognition, and music captions. Moreover, textual information paired with audio data forms the premise for generating instruction prompts.

Task Over-Fitting

Even when implementing only the primary two training stages, the SALMONN framework delivers competitive results on instruction tuning tasks, although the performance isn’t up to speed when performing cross-modal tasks, especially on tasks that require cross-modal co-reasoning abilities. Specifically, the model occasionally violates instruction prompts that lead to the generation of irrelevant or incorrect responses, and this phenomenon is known as task overfitting within the SALMONN framework, and the Activation Tuning stage is implemented to resolve these overfitting issues.

Activation Tuning Stage

An efficient approach to resolve overfitting issues is to regularize intrinsic conditional language models using longer and more diverse responses like storytelling or auditory-information based query answering. The framework then generates the pair training data for such tasks using text paired with audio or speech or music captions.

Task Specifications

To guage SALMONN’s zero-shot cross-modal emergent abilities, developers have included 15 speech, audio and music tasks divided across three levels.

Level 1

In the primary level, tasks are used for instruction tuning, and due to this fact, they’re the best set of tasks that the SALMONN framework has to perform.

Level 2

The second level consists of untrained tasks, and the complexity level is higher when put next to level 1 tasks. In level 2, tasks are Natural Language Processing based tasks including speech keyword extraction that’s used to judge the framework’s accuracy when extracting certain keywords using speech. Other tasks include SQQA or Spoken Query-based Query Answering that evaluates the common sense knowledge the framework extracts using speech questions, a SF or Speech-based Slot Filling task to judge the accuracy of slot values, and eventually, there are two AST tasks for English to German, and English to Japanese conversions.

Level 3

The complexity of tasks in Level 3 is the utmost when put next to other two levels, and it includes SAC or Speech Audio Co-Reasoning, and Audio-based Storytelling tasks. The SAC task requires the SALMONN framework to grasp an issue included within the audio clip fed to the model, find supportive evidence using audio events or music within the background, and eventually generate an appropriate reason to reply the query. The Audio-based storytelling tasks require the model to generate a meaningful story based on the auditory information sourced from general audio inputs.

Results

Level 1 Tasks

The next table demonstrates the outcomes on Level 1 tasks, and as it will probably be observed, the SALMONN framework returns competitive results on Level 1 tasks with or without activation-tuning.

Level 2 and three Tasks

Although the SALMONN framework returns competitive results on Level 1 tasks even without fine-tuning, the identical can’t be said for Level 2 and Level 3 tasks as without activation, the SALMONN framework suffers heavily from over-fitting on tasks. The performance dips even further on SQQA, SAC, and Storytelling tasks with emphasis on multimodal interactions, and the SALMONN framework struggles to follow instructions without activation tuning. Nevertheless, with activation tuning, the outcomes improve considerably, and the outcomes are included in the next image.

Discounting LoRA Scaling Factor

Discounting LoRA Scaling Factor evaluates the influence of using time-test discounting of the LoRA scaling factor to reduce overfitting issues on tasks. As it will probably be observed in the next figure, a decrease within the LoRA scaling factor to 2.0 elevates the cross-modal reasoning ability of the SALMONN framework on ASR & PR tasks, SQQA tasks, Storytelling tasks, and SAC tasks respectively.

Evaluating Task-Overfitting

To emphasise on activation tuning, the SALMONN framework analyzes the changes in perplexity in the course of the three training stages, and as it will probably be seen in the next image, perplexity changes for AAC and ASR tasks have small final values post the primary training stage, indicating the model’s learning of cross-modal alignments.

Moreover, the perplexity of the PR task also drops post instruction tuning owing to its reliance on the LoRA component to learn the output tokens. It is usually observed that although instruction tuning helps in reducing the perplexity on Storytelling and SAC tasks, the gap remains to be large enough to perform the tasks successfully unless a further activation stage is added or the LoRA component is removed.

Activation Tuning

The SALMONN framework dives into different activation methods including training the model on text-based QA task pairs with long answers, or using audio-based long written stories, whereas using long speech transcriptions for ASR tasks. Each the Q-Former and LoRA components are fine-tuned using these three methods. Moreover, the framework ignores the audio and Q-Former inputs in an try to fine-tune the LoRA and Vicuna components as an adaptive text-based large language model, and the outcomes are demonstrated in the next image, and as it will probably be seen, the model can’t be activated by ASR ( training ASR with long labels), nor Story or Text-based by training LoRA component using text prompt inputs.

Final Thoughts

In this text, we’ve got talked about SALMONN or Speech Audio Language Music Open Neural Network, a single audio-text multimodal large language model framework able to perceiving and understanding three basic audio or sound types including speech, audio events, and music. The SALMONN model enables Large Language Models to grasp and process generic audio inputs directly, and deliver competitive performance on a wide selection of audio & speech tasks.

The SALMONN framework delivers competitive performance on a wide selection of trained tasks including audio captioning, speech translation & recognition, and more while generalizing to a number of untrained understanding tasks including speech translation for keyword extracting and untrained languages. Owing to its abilities, the SALMONN framework will be thought to be the following step towards enhancing the generic hearing abilities of enormous language models.