On this post, we explore Google’s modern approach to training their remarkable text-to-music models, including MusicLM and Noise2Music. We’ll delve into the concept of “fake” datasets and the way they were utilized in these breakthrough models. When you’re inquisitive about the inner workings of those techniques and their impact on advancing music AI, you’ve come to the appropriate place.
Large language models (LLMs) like ChatGPT or Bard are trained on huge amounts of unstructured text data. Although it may well be computationally expensive to gather the content of hundreds of thousands of internet sites, there may be an abundance of coaching data on the general public web. In contrast, text-to-image models like DALL-E 2 require a very different form of dataset consisting of pairs of images with corresponding descriptions.
In the identical way, text-to-music models depend on songs with descriptions of their musical content. Nevertheless, unlike images, labeled music is absolutely hard to seek out on the web. Sometimes, metadata like instrumentation, genre, or mood, can be found, but full-text in-depth descriptions are exceptionally hard to acquire. This poses a major problem for researchers and firms attempting to collect data to coach generative music models.
In early 2023, Google researchers created a variety of buzz around music AI with their breakthrough models, MusicLM and Noise2Music. Nevertheless, amongst musicians, little is thought about how the information for these models were collected. Let‘s dive into this topic together and study a few of the tricks applied in Google’s music AI research.
Weakly Associated Labels
For MusicLM and Noise2Music, Google relied on one other one in all their models called MuLan, which was trained to compute the similarity between any piece of music and any text description. To coach MuLan, Google used what we call “weakly associated labels”. As a substitute of rigorously curating a dataset of music with high-quality text descriptions, they purposefully took a distinct approach.
First, they extracted a 30-second snippet from 44 million music videos available on YouTube, leading to 370k hours of audio. The music was then labeled with various texts related to the video: the video title and outline, comments, the names of playlists featuring the video, and more. To scale back noise on this dataset, they employed a big language model to discover which associated text information had music-related content and discarded every thing that didn’t.
In my view, weakly associated labels cannot be considered a “fake” dataset, yet, since the text information was still written by real humans and is undoubtedly related to the music to some extent. Nevertheless, this approach definitely prioritizes quantity over quality, which might have raised concerns amongst most machine learning researchers up to now. And Google was just getting began…
Fake Labels
Noise2Music is a generative music AI based on diffusion technology, which was also utilized in image generation models like DALL-E or Midjourney.
To coach Noise2Music, Google took their previous approach to the acute and transitioned from weakly associated labels to totally artificial labels. In what they discuss with as “pseudo labeling”, the authors adopted a remarkable method to gather music description texts. They prompted a big language model (LaMDA) to put in writing multiple descriptions for 150k popular songs, leading to 4 million descriptions. Here is an example of such an outline:
“Don’t Stop Me Now” by Queen : The energetic rock song builds on a piano, bass guitar, and drums. The singers are excited, able to go, and uplifting.
Subsequently, the researchers removed the song and artist names to provide descriptions that would, in principle, apply to other songs, as well. Nevertheless, even with these descriptions in hand, the researchers still needed to match them with suitable songs to acquire a big labeled dataset. Here is where MuLan, their model trained on weakly associated labels, proved to be useful.
The researchers collected a big dataset of unlabeled music, leading to 340k hours of music. For every of those tracks, they utilized MuLan to discover the artificially generated song description that best matched it. Essentially, each bit of music just isn’t mapped to a text describing the song itself, but to an outline that encapsulates music just like it.
The Issue
In traditional machine learning, the labels assigned to every commentary (on this case, a bit of music) should ideally represent an objective truth. Nevertheless, music descriptions inherently lack objectivity, presenting the primary problem. Moreover, by utilizing audio-to-text mapping technology, the labels now not reflect a “truthful” representation of what is going on within the song. They don’t provide an accurate description of the music. Given these apparent flaws, one may wonder why this approach still yields useful results.
Bias vs. Noise
When a dataset’s labels are usually not accurately assigned, there might be two most important causes: bias and noise. Bias refers to a consistent tendency for the labels to be untruthful in a selected way. As an illustration, if the dataset regularly labels instrumental pieces as songs but never identifies songs as instrumental pieces, it demonstrates a bias toward predicting the presence of vocals.
Alternatively, noise indicates a general variability within the labels, whatever the direction. For instance, if every track is labeled as a “sad piano piece,” the dataset is heavily biased, because it consistently provides an inaccurate label for a lot of songs. Nevertheless, because it applies the identical label to each track, there isn’t any variability and subsequently no noise present within the dataset.
By mapping tracks to descriptive texts written for other tracks, we introduce noise. It is because, for many tracks, it’s unlikely that there exists an ideal description for it within the dataset. Consequently, most labels are slightly bit off, i.e. untruthful, which leads to noise. Nevertheless, are the labels biased?
For the reason that available descriptions were generated for popular songs, it is cheap to assume that the pool of descriptions is biased toward (western) popular music. Nevertheless, with 4 million descriptions based on 150k unique songs, one would expect a various range of descriptions to pick from. Moreover, most labeled music datasets exhibit the identical bias, so this just isn’t a novel drawback of this approach in comparison with others. What truly sets this approach apart is the introduction of added noise.
Why Noise might be O.K. in Machine Learning
Training a machine learning model on a biased dataset is often not a desirable approach because it will lead to the model learning and replicating a biased understanding of the duty at hand. Nevertheless, training a machine learning model on unbiased but noisy data can still yield impressive results. Allow me for example this with an example.
Consider the figure below, which depicts two datasets consisting of orange and blue points. Within the noise-free dataset, the blue and orange points are perfectly separable. Nevertheless, within the noisy dataset, some orange points have shifted into the blue point cluster, and vice versa. Despite this added noise, if we examine the trained models, we observe that each models discover roughly the identical patterns. It is because, even within the presence of noise, the AI learns to discover the optimal pattern that minimizes errors as much as possible.
This instance demonstrates that an AI can indeed learn from noisy datasets, reminiscent of the one generated by Google. Nevertheless, the most important challenge lies within the indisputable fact that the noisier the dataset is, the larger amount of coaching data required to effectively train the model. This rationale is justified by the understanding that a loud dataset inherently accommodates less priceless information in comparison with an equivalent noise-free dataset of the identical size.
In conclusion, Google employed modern techniques to handle the challenge of limited labeled music data in training their generative music AI models. They utilized weakly associated labels for MuLan, leveraging text information from various sources related to music videos, and employed a language model to filter out irrelevant data. When developing Noise2Music, they introduced fake labels by generating multiple descriptions for popular songs and mapping them to appropriate tracks using their pre-trained model.
While these approaches may deviate from traditional labeling methods, they still proved effective. Despite introducing noise, the models were still in a position to learn and discover optimal patterns. Although the utilization of pretend datasets could also be considered unconventional, it highlights the immense potential of recent language models in creating large and priceless datasets for machine learning.