
LASS or Language-queried Audio Source Separation is the brand new paradigm for CASA or Computational Auditory Scene Evaluation that goals to separate a goal sound from a given mixture of audio using a natural language query that gives the natural yet scalable interface for digital audio tasks & applications. Although the LASS frameworks have advanced significantly prior to now few years when it comes to achieving desired performance on specific audio sources like musical instruments, they’re unable to separate the goal audio within the open domain.
AudioSep, is a foundational model that goals to resolve the present limitations of LASS frameworks by enabling goal audio separation using natural language queries. The developers of the AudioSep framework have trained the model extensively on a wide range of large-scale multimodal datasets, and have evaluated the performance of the framework on a wide selection of audio tasks including musical instrument separation, audio event separation, and enhancing the speech amongst many others. The initial performance of AudioSep satisfies the benchmarks because it demonstrates impressive zero-shot learning capabilities and delivers strong audio separation performance.
In this text, we will probably be taking a deeper dive into the working of the AudioSep framework as we’ll evaluate the architecture of the model, the datasets used for training & evaluation, and the essential concepts involved within the working of the AudioSep model. So let’s begin with a basic introduction to the CASA framework.
The CASA or the Computational Auditory Scene Evaluation framework is a framework utilized by developers to design machine listening systems which have the power to perceive complex sound environments in a way much like the way in which humans perceive sound using their auditory systems. Sound separation, with a special focus on the right track sound separation, is a fundamental area of research inside the CASA framework, and it goals to resolve the “cocktail party problem” or separating real-world audio recordings from individual audio source recordings or files. The importance of sound separation may be attributed mainly to its widespread applications including music source separation, audio source separation, speech enhancement, goal sound identification, and rather a lot more.
Many of the work on sound separation done prior to now revolves mainly across the separation of a number of audio sources like music separation or speech separation. A brand new model going by the name of USS or Universal Sound Separation goals to separate arbitrary sounds in real world audio recordings. Nonetheless, it’s a difficult & restrictive task to separate every sound source from an audio mixture primarily due to the big range of various sound sources existing on this planet which is the foremost reason why the USS method isn’t feasible for real-world applications working in real-time.
A feasible alternative to the USS method is the QSS or the Query-based Sound Separation method that goals to separate a person or goal sound source from the audio mixture based on a specific set of queries. Because of this, the QSS framework allows developers & users to extract the specified sources of audio from the mixture based on their requirements that makes the QSS method a more practical solution for digital real-world applications like multimedia content editing or audio editing.
Moreover, developers have recently proposed an extension of the QSS framework, the LASS framework or the Language-queried Audio Source Separation framework that goals to separate arbitrary sources of sound from an audio mixture by making use of the natural language descriptions of the goal audio source. Because the LASS framework allows users to extract the goal audio sources using a set of natural language instructions, it would develop into a robust tool with widespread applications in digital audio applications. When put next against traditional audio-queried or vision-queried methods, using natural language instructions for audio separation offers a greater degree of advantage because it adds flexibility, and makes the acquisition of query information far more easier & convenient. Moreover, when put next with label query-based audio separation frameworks that make use of a predefined set of instructions or queries, the LASS framework doesn’t limit the variety of input queries, and has the pliability to be generalized to open domain seamlessly.
Originally, the LASS framework relies on supervised learning during which the model is trained on a set of labeled audio-text paired data. Nonetheless, the principal issue with this approach is the limited availability of annotated & labeled audio-text data. With a purpose to reduce the reliability of the LASS framework on annotated audio-text labeled data, the models are trained using the multimodal supervision learning approach. The first aim behind using a multimodal supervision approach is to make use of multimodal contrastive pre-training models just like the CLIP or Contrastive Language Image Pre Training model because the query encoder for the framework. For the reason that CLIP framework has the power to align text embeddings with other modalities like audio or vision, it allows developers to coach the LASS models using data-rich modalities, and allows the interference with the textual data in a zero-shot setting. The present LASS frameworks nevertheless make use of small-scale datasets for training, and applications of the LASS framework across a whole bunch of potential domains are yet to be explored.
To resolve the present limitations faced by the LASS frameworks, developers have introduced AudioSep, a foundational model that goals to separate sound from an audio mixture using natural language descriptions. The present focus for AudioSep is to develop a pre-trained sound separation model that leverages existing large-scale multimodal datasets to enable the generalization of LASS models in open-domain applications. To summarize, the AudioSep model is : “A foundational model for universal sound separation in open domain using natural language queries or descriptions trained on large-scale audio & multimodal datasets”.
AudioSep : Key Components & Architecture
The architecture of the AudioSep framework comprises two key components: a text encoder, and a separation model.
The Text Encoder
The AudioSep framework uses a text encoder of the CLIP or Contrastive Language Image Pre Training model or the CLAP or Contrastive Language Audio Pre Training model to extract text embeddings inside a natural language query. The input text query consists of a sequence of “N” tokens that’s then processed by the text encoder to extract the text embeddings for the given input language query. The text encoder makes use of a stack of transformer blocks to encode the input text tokens, and the output representations are aggregated after they’re passed through the transformer layers that leads to the event of a D-dimensional vector representation with fixed length where D corresponds to the scale of CLAP or the CLIP models while the text encoder is frozen in the course of the training period.
The CLIP model is pre-trained on a large-scale dataset of image-text paired data using contrastive learning which is the first reason why its text encoder learns mapping textual descriptions on the semantic space that can also be shared by the visual representations. The advantage the AudioSep gains by utilizing CLIP’s text encoder is that it may now scale up or train the LASS model from unlabeled audio-visual data using the visual embeddings instead, thus enabling the training of LASS models without the requirement of annotated or labeled audio-text data.
The CLAP model works much like the CLIP model and makes use of contrastive learning objective because it uses a text & an audio encoder to attach audio & language, thus bringing text & audio descriptions on an audio-text latent space joined together.
Separation Model
The AudioSep framework makes use of a frequency-domain ResUNet model that’s fed a combination of audio clips because the separation backbone for the framework. The framework works by first applying an STFT or a Short-Time Fourier Transform on the waveform to extract a posh spectrogram, the magnitude spectrogram, and the Phase of X. The model then follows the identical setting and constructs an encoder-decoder network to process the magnitude spectrogram.
The ResUNet encoder-decoder network consists of 6 residual blocks, 6 decoder blocks, and 4 bottleneck blocks. The spectrogram in each encoder block uses 4 residual conventional blocks to downsample itself right into a bottleneck feature whereas the decoder blocks make use of 4 residual deconvolutional blocks to acquire the separation components by upsampling the features. Following this, each of the encoder blocks & its corresponding decoder blocks establish a skip connection that operates at the identical upsampling or downsampling rate. The residual block of the framework consists of two Leaky-ReLU activation layers, 2 batch normalization layers, and a pair of CNN layers, and moreover, the framework also introduces a further residual shortcut that connects the input & output of each individual residual block. The ResUNet model takes the complex spectrogram X because the input, and produces the magnitude mask M because the output with the phase residual being conditioned on text embeddings that controls the magnitude of scaling, and rotation of the angle of the spectrogram. The separated complex spectrogram can then be extracted by multiplying the anticipated magnitude mask & phase residual with STFT (Short-Time Fourier Transform) of the mixture.
In its framework, AudioSep uses a FiLm or Feature-wise Linearly modulated layer to bridge the separation model & the text encoder after the deployment of the convolutional blocks within the ResUNet.
Training and Loss
In the course of the training of the AudioSep model, developers use the loudness augmentation method, and train the AudioSep framework end-to-end by making use of an L1 loss function between the bottom truth & predicted waveforms.
Datasets and Benchmarks
As mentioned in previous sections, AudioSep is a foundational model that goals to resolve the present dependency of LASS models on annotated audio-text paired datasets. The AudioSep model is trained on a wide selection of datasets to equip it with multimodal learning capabilities, and here is an in depth description of the dataset & benchmarks utilized by developers to coach the AudioSep framework.
AudioSet
AudioSet is a weakly-labeled large-scale audio dataset comprising over 2 million 10-second audio snippets extracted directly from YouTube. Each audio snippet within the AudioSet dataset is categorized by the absence or presence of sound classes without the particular timing details of the sound events. The AudioSet dataset has over 500 distinct audio classes including natural sounds, human sounds, vehicle sounds, and rather a lot more.
VGGSound
The VGGSound dataset is a large-scale visual-audio dataset that identical to AudioSet has been sourced directly from YouTube, and it accommodates over 2,00,000 video clips, each of them having a length of 10 seconds. The VGGSound dataset is categorized into over 300 sound classes including human sounds, natural sounds, bird sounds, and more. The usage of the VGGSound dataset ensures that the item liable for producing the goal sound can also be describable within the corresponding visual clip.
AudioCaps
AudioCaps is the biggest audio captioning dataset available publicly, and it comprises over 50,000 10-second audio clips which might be extracted from the AudioSet dataset. The info within the AudioCaps is split into three categories: training data, testing data, and validation data, and the audio clips are humanly-annotated with natural language descriptions using the Amazon Mechanical Turk platform. It’s value noting that every audio clip within the training dataset has a single caption, whereas the info within the testing & validation sets each have 5 ground-truth captions.
ClothoV2
The ClothoV2 is an audio captioning dataset that consists of clips sourced from the FreeSound platform, and identical to AudioCaps, each audio clip is humanly-annotated with natural language descriptions using the Amazon Mechanical Turk platform.
WavCaps
Similar to AudioSet, WavCaps is a weakly-labeled large-scale audio dataset comprising over 400,000 audio clips with captions, and a complete runtime approximating to 7568 hours of coaching data. The audio clips within the WavCaps dataset are sourced from a wide selection of audio sources including BBC Sound Effects, AudioSet, FreeSound, SoundBible, and more.
Training Details
In the course of the training phase, the AudioSep model randomly samples two audio segments sourced from two different audio clips from the training dataset, after which mixes them together to create a training mixture where the length of every audio segment is about 5 seconds. The model then extracts the complex spectrogram from the waveform signal using a Hann window of size 1024 with a 320 hop size.
The model then makes use of the text encoder of the CLIP/CLAP models to extract the textual embeddings with text supervision being the default configuration for AudioSep. For the separation model, the AudioSep framework uses a ResUNet layer consisting of 30 layers, 6 encoder blocks, and 6 decoder blocks resembling the architecture followed within the universal sound separation framework. Moreover, each encoder block has two convolutional layers with a 3×3 kernel size with the variety of output feature maps of encoder blocks being 32, 64, 128, 256, 512, and 1024 respectively. The decoder blocks share symmetry with the encoder blocks, and the developers apply the Adam optimizer to coach the AudioSep model with a batch size of 96.
Evaluation Results
On Seen Datasets
The next figure compares the performance of AudioSep framework on seen datasets in the course of the training phase including the training datasets. The below figure represents the benchmark evaluation results of the AudioSep framework when put next against baseline systems including Speech Enhancement models, LASS, and CLIP. The AudioSep model with CLIP text encoder is represented as AudioSep-CLIP, whereas the AudioSep model with CLAP text encoder is represented as AudioSep-CLAP.
As it may be seen within the figure, the AudioSep framework performs well when using audio captions or text labels as input queries, and the outcomes indicate the superior performance of the AudioSep framework when put next against previous benchmark LASS and audio-queried sound separation models.
On Unseen Datasets
To evaluate the performance of AudioSep in a zero-shot setting, developers continued to guage the performance on unseen datasets, and the AudioSep framework delivers impressive separation performance in a zero-shot setting, and the outcomes are displayed within the figure below.
Moreover, the image below shows the outcomes of evaluating the AudioSep model against Voicebank-Demand speech enhancement.
The evaluation of the AudioSep framework indicates a robust & desired performance on unseen datasets in a zero-shot setting, and thus makes way for performing sound operation tasks on latest data distributions.
Visualization of Separation Results
The below figure shows the outcomes obtained when the developers used the AudioSep-CLAP framework to perform visualizations of spectrograms for ground-truth goal audio sources, and audio mixtures and separated audio sources using text queries of diverse audios or sounds. The outcomes allowed developers to look at that the spectrogram’s separated source pattern is near the source of the bottom truth that further supports the target results obtained in the course of the experiments.
Comparison of Text Queries
The developers evaluate the performance of AudioSep-CLAP and AudioSep-CLIP on AudioCaps Mini, and the developers make use of the AudioSet event labels , the AudioCaps captions, and re-annotated natural language descriptions to look at the consequences of various queries, and the next figure shows an example of the AudioCaps Mini in motion.
Conclusion
AudioSep is a foundational model that’s developed with the aim of being an open-domain universal sound separation framework that uses natural language descriptions for audio separation. As observed in the course of the evaluation, the AudioSep framework is able to performing zero-shot & unsupervised learning seamlessly by making use of audio captions or text labels as queries. The outcomes & evaluation performance of AudioSep indicate a robust performance that outperforms current cutting-edge sound separation frameworks like LASS, and it could be capable enough to resolve the present limitations of popular sound separation frameworks.