Home Artificial Intelligence How Meta’s AI Generates Music Based on a Reference Melody MusicGen by Meta Showcase How Text-to-Music Models Are Trained A Easy Tweak to the Training Recipe What’s “The Melody”? Limitations Future Perspectives

How Meta’s AI Generates Music Based on a Reference Melody MusicGen by Meta Showcase How Text-to-Music Models Are Trained A Easy Tweak to the Training Recipe What’s “The Melody”? Limitations Future Perspectives

0
How Meta’s AI Generates Music Based on a Reference Melody
MusicGen by Meta
Showcase
How Text-to-Music Models Are Trained
A Easy Tweak to the Training Recipe
What’s “The Melody”?
Limitations
Future Perspectives

MusicGen, analyzed

Towards Data Science
Image by writer.

On June thirteenth, 2023, Meta (formerly Facebook) made waves within the music and AI communities with the discharge of their generative music model, MusicGen. This model not only surpasses Google’s MusicLM, which was launched earlier this 12 months, when it comes to capabilities but can also be trained on licensed music data and open-sourced for non-commercial use.

This implies you can not only read the research paper or take heed to demos but additionally copy their code from GitHub or experiment with the model in an internet app on HuggingFace.

Along with generating audio from a text prompt, MusicGen also can generate music based on a given reference melody, a feature generally known as melody conditioning. On this blog post, I’ll show how Meta implemented this handy and interesting functionality into their model. But before we delve into that, let’s first understand how melody conditioning works in practice.

Base Track

The next is a brief electronic music snippet that I produced for this text. It features electronic drums, two dominant 808 bass and two syncopated synths. When listening to it, attempt to discover the “primary melody” of the track.

Using MusicGen, I can now generate music in other genres that keep on with the identical primary melody. All I would like for that’s my base track and a text prompt describing how the brand new piece should sound.

Orchestral Variant

A grand orchestral arrangement with thunderous percussion, epic brass fanfares, and soaring strings, making a cinematic atmosphere fit for a heroic battle.

Reggae Variant

classic reggae track with an electronic guitar solo

Jazz Variant

smooth jazz, with a saxophone solo, piano chords, and snare full drums

How Good are the Results?

Although MusicGen doesn’t adhere closely to my text prompts and creates music that’s barely different from what I asked for, the generated pieces still accurately reflect the requested genre and, more importantly, every bit showcases its own interpretation of the primary melody from the bottom track.

While the outcomes usually are not perfect, I find the capabilities of this model to be quite impressive. The incontrovertible fact that MusicGen has been one of the crucial popular models on HuggingFace ever since its release further emphasizes its significance. With that said, let’s delve deeper into the technical facets of how melody conditioning works.

Three text-music pairs as they’re used for training models like MusicLM or MusicGen. Image by writer.

Just about all current generative music models follow the identical procedure during training. They’re supplied with a big database of music tracks accompanied by corresponding text descriptions. The model learns the connection between words and sounds, in addition to convert a given text prompt right into a coherent and enjoyable piece of music. Through the training process, the model optimizes its own compositions by comparing them to the actual music tracks within the dataset. This allows the model to discover its strengths and areas that require improvement.

The problem lies within the incontrovertible fact that once a machine learning model is trained for a selected task, resembling text-to-music generation, it is restricted to that exact task. While it is feasible to make MusicGen perform certain tasks that it was not explicitly trained for, like continuing a given piece of music, it can’t be expected to tackle every music generation request. For example, it cannot simply take a melody and transform it into a distinct genre. This could be like throwing potatoes right into a toaster and expecting fries to return out. As a substitute, a separate model should be trained to implement this functionality.

Let’s explore how Meta adapted the model training procedure to enable MusicGen to generate variations of a given melody based on a text prompt. Nevertheless, there are several challenges related to this approach. Considered one of the first obstacles is the anomaly in identifying “the melody” of a song and representing it in a computationally meaningful way. Nonetheless, for the aim of understanding the brand new training procedure at a broader level, let’s assume a consensus on what constitutes “the melody” and the way it could be easily extracted and fed into the model. On this scenario, the adjusted training method might be outlined as follows:

Three text-music-melody pairs as they were used for teaching MusicGen melody-conditioned generation.

For every track within the database, step one is to extract its melody. Subsequently, the model is fed with each the track’s text description and its corresponding melody, prompting the model to recreate the unique track. Essentially, this approach simplifies the unique training objective, where the model was solely tasked with recreating the track based on text.

To grasp why we do that, let’s ask ourselves what the AI model learns on this training procedure. In essence, it learns how a melody might be became a full piece of music based on a text description. Because of this after the training, we will provide the model with a melody and request it to compose a bit of music with any genre, mood, or instrumentation. To the model, this is similar “semi-blind” generation task it has successfully achieved countless times during training.

Having grasped the technique employed by Meta to show the model melody-conditioned music generation, we still must tackle the challenge of precisely defining what constitutes “the melody.”

The reality is, there is no such thing as a objective method to find out or extract “the melody” of a polyphonic musical piece, except when all instruments are playing in unison. While there is usually a distinguished instrument resembling a voice, guitar, or violin, it doesn’t necessarily imply that the opposite instruments usually are not a part of “the melody.” Take Queen’s “Bohemian Rhapsody” for example. Whenever you consider the song, you may first recall Freddie Mercury’s primary vocal melodies. Nevertheless, does that mean the piano within the intro, the background singers in the center section, and the electrical guitar before “So you’re thinking that you possibly can stone me […]” usually are not a part of the melody?

One method for extracting “the melody” of a song is to think about essentially the most distinguished melody as essentially the most dominant one, typically identified because the loudest melody in the combo. The chromagram is a widely utilized representation that visually displays essentially the most dominant musical notes throughout a track. Below, you’ll find the chromagram of the reference track, initially with the whole instrumentation after which excluding drums and bass. On the left side, essentially the most relevant notes for the melody (B, F#, G) are highlighted in blue.

Each chromagrams accurately depict the first melody notes, with the version of the track without drums and bass providing a clearer visualization of the melody. Meta’s study also revealed the identical commentary, which led them to utilize their source separation tool (DEMUCS) to remove any disturbing rhythmic elements from the track. This process ends in a sufficiently representative rendition of “the melody,” which may then be fed to the model.

In summary, we will now connect the pieces to know the underlying process when requesting MusicGen to perform melody-conditioned generation. Here’s a visual representation of the workflow:

How MusicGen produces a melody-conditioned music output. Image by writer.
Photo by Xavier von Erlach on Unsplash

While MusicGen shows promising advancements in melody-conditioning, it can be crucial to acknowledge that the technology continues to be a work-in-progress. Chromagrams, even when drums and bass are removed, offer an imperfect representation of a track’s melody. One limitation is that chromagrams categorize all notes into the 12 western pitch classes, meaning they capture the transition between two pitch classes but not the direction (up or down) of the melody.

For example, the melodic interval between moving from C4 to G4 (an ideal fifth) differs significantly from moving from C4 to G3 (an ideal fourth). Nevertheless, in a chromagram, each intervals would seem the identical. The problem worsens with octave jumps, because the chromagram would indicate the melody stayed on the identical note. Consider how a chromagram would misinterpret the emotional octave jump performed by Céline Dion in “My Heart Will Go On” in the course of the line “wher-e-ver you’re” as a stable melodic movement. To show this, just have a look at the chromagram for the chorus in A-ha’s “Tackle Me”, below. Does this reflect your idea of the song’s melody?

A chromagram of the chorus in “Tackle Me” (A-ha), bass and drums removed. Image by writer.

One other challenge is the inherent bias of the chromagram. It performs well in capturing the melody of some songs while completely missing the mark in others. This bias is systematic slightly than random. Songs with dominant melodies, minimal interval jumps, and unison playing are higher represented by the chromagram in comparison with songs with complex melodies spread across multiple instruments and featuring large interval jumps.

Moreover, the restrictions of the generative AI model itself are value noting. The output audio still exhibits noticeable differences from human-made music, and maintaining a consistent style over a six-second interval stays a struggle. Furthermore, MusicGen falls short in faithfully capturing the more intricate facets of the text prompt, as evidenced by the examples provided earlier. It would require further technological advancements for melody-conditioned generation to achieve a level where it could be used not just for amusement and inspiration but additionally for generating end-user-friendly music.

Photo by Marc Sendra Martorell on Unsplash

How can we improve the AI?

From my perspective, certainly one of the first concerns that future research should address regarding melody-conditioned music generation is the extraction and representation of “the melody” from a track. While the chromagram is a well-established and simple signal processing method, there are many newer and experimental approaches that utilize deep learning for this purpose. It could be exciting to witness firms like Meta drawing inspiration from these advancements, a lot of that are covered in a comprehensive 72-page review by Reddy et al. (2022).

Regarding the standard of the model itself, each the audio quality and the comprehension of text inputs might be enhanced through scaling up the dimensions of the model and training data, in addition to the event of more efficient algorithms for this specific task. In my view, the discharge of MusicLM in January 2023 resembles a “GPT-2 moment.” We’re starting to witness the capabilities of those models, but significant improvements are still needed across various facets. If this analogy holds true, we will anticipate the discharge of a music generation model akin to GPT-3 prior to we would expect.

How does this impact musicians?

As is usually the case with generative music AI, concerns arise regarding the potential negative impact on the work and livelihoods of music creators. I expect that in the longer term, it would turn into increasingly difficult to earn a living by creating variations of existing melodies. This is especially evident in scenarios resembling jingle production, where firms can effortlessly generate quite a few variations of a characteristic jingle melody at minimal cost for brand new ad campaigns or personalized advertisements. Undoubtedly, this poses a threat to musicians who depend on such activities as a big source of income. I reiterate my plea for creatives involved in producing music valued for its objective musical qualities slightly than subjective, human qualities (resembling stock music or jingles) to explore alternative income sources to arrange for the longer term.

On the positive side, melody-conditioned music generation presents an incredible tool for enhancing human creativity. If someone develops a charming and memorable melody, they will quickly generate examples of how it would sound in various genres. This process may help discover the best genre and magnificence to bring the music to life. Furthermore, it offers a chance to revisit past projects inside one’s music catalogue, exploring their potential when translated into different genres or styles. Finally, this technology lowers the entry barrier for creatively inclined individuals without formal musical training to enter the sphere. Anyone can now provide you with a melody, hum it right into a smartphone microphone, and share remarkable arrangements of their ideas with friends, family, and even attempt to achieve a wider audience.

The query of whether AI music generation is useful to our societies stays open for debate. Nevertheless, I firmly imagine that melody-conditioned music generation is certainly one of the use cases of this technology that genuinely enhances the work of each skilled and aspiring creatives. It adds value by offering recent avenues for exploration. I’m eagerly looking forward to witnessing further advancements on this field within the near future.

LEAVE A REPLY

Please enter your comment!
Please enter your name here