Home News Text-to-Music Generative AI : Stability Audio, Google’s MusicLM and More

Text-to-Music Generative AI : Stability Audio, Google’s MusicLM and More

Text-to-Music Generative AI : Stability Audio, Google’s MusicLM and More

Music, an art form that resonates with the human soul, has been a continuing companion of us all. Creating music using artificial intelligence began several a long time ago. Initially, the attempts were easy and intuitive, with basic algorithms creating monotonous tunes. Nonetheless, as technology advanced, so did the complexity and capabilities of AI music generators, paving the best way for deep learning and Natural Language Processing (NLP) to play pivotal roles on this tech.

Today platforms like Spotify are leveraging AI to fine-tune their users’ listening experiences. These deep-learning algorithms dissect individual preferences based on various musical elements corresponding to tempo and mood to craft personalized song suggestions. They even analyze broader listening patterns and scour the web for song-related discussions to construct detailed song profiles.

The Origin of AI in Music: A Journey from Algorithmic Composition to Generative Modeling

Within the early stages of AI mixing within the music world, spanning from the Nineteen Fifties to the Seventies, the main focus was totally on algorithmic composition. This was a technique where computers used an outlined algorithm to create music. The primary notable creation during this era was the Illiac Suite for String Quartet in 1957. It used the Monte Carlo algorithm, a process involving random numbers to dictate the pitch and rhythm throughout the confines of traditional musical theory and statistical probabilities.

Image generated by the creator using Midjourney

During this time, one other pioneer, Iannis Xenakis, utilized stochastic processes, an idea involving random probability distributions, to craft music. He used computers and the FORTRAN language to attach multiple probability functions, making a pattern where different graphical representations corresponded to diverse sound spaces.

The Complexity of Translating Text into Music

Music is stored in a wealthy and multi-dimensional format of information that encompasses elements corresponding to melody, harmony, rhythm, and tempo, making the duty of translating text into music highly complex. A normal song is represented by nearly one million numbers in a pc, a figure significantly higher than other formats of information like image, text, etc.

The sector of audio generation is witnessing revolutionary approaches to beat the challenges of making realistic sound. One method involves generating a spectrogram, after which converting it back into audio.

One other strategy leverages the symbolic representation of music, like sheet music, which could be interpreted and played by musicians. This method has been digitized successfully, with tools like Magenta’s Chamber Ensemble Generator creating music within the MIDI format, a protocol that facilitates communication between computers and musical instruments.

While these approaches have advanced the sphere, they arrive with their very own set of limitations, underscoring the complex nature of audio generation.

Transformer-based autoregressive models and U-Net-based diffusion models, are on the forefront of technology, producing state-of-the-art (SOTA) leads to generating audio, text, music, and rather more. OpenAI’s GPT series and just about all other LLMs currently are powered by transformers utilizing either encoder, decoder, or each architectures. On the art/image side, MidJourney, Stability AI, and DALL-E 2 all leverage diffusion frameworks. These two core technologies have been key in achieving SOTA leads to the audio sector as well. In this text, we’ll delve into Google’s MusicLM and Stable Audio, which stand as a testament to the remarkable capabilities of those technologies.

Google’s MusicLM

Google’s MusicLM was released in May this 12 months. MusicLM can generate high-fidelity music pieces, that resonate with the precise sentiment described within the text. Using hierarchical sequence-to-sequence modeling, MusicLM has the aptitude to rework text descriptions into music that resonates at 24 kHz over prolonged durations.

The model operates on a multi-dimensional level, not only adhering to the textual inputs but additionally demonstrating the power to be conditioned on melodies. This implies it may possibly take a hummed or whistled melody and transform it in keeping with the style delineated in a text caption.

Technical Insights

The MusicLM leverages the principles of AudioLM, a framework introduced in 2022 for audio generation. AudioLM synthesizes audio as a language modeling task inside a discrete representation space, utilizing a hierarchy of coarse-to-fine audio discrete units, also often known as tokens. This approach ensures high-fidelity and long-term coherence over substantial durations.

To facilitate the generation process, MusicLM extends the capabilities of AudioLM to include text conditioning, a way that aligns the generated audio with the nuances of the input text. That is achieved through a shared embedding space created using MuLan, a joint music-text model trained to project music and its corresponding text descriptions close to one another in an embedding space. This strategy effectively eliminates the necessity for captions during training, allowing the model to be trained on massive audio-only corpora.

MusicLM model also uses SoundStream as its audio tokenizer, which might reconstruct 24 kHz music at 6 kbps with impressive fidelity, leveraging residual vector quantization (RVQ) for efficient and high-quality audio compression.

An illustration of the independent pretraining process for the foundational models of MusicLM: SoundStream, w2v-BERT, and MuLan,

An illustration of the pretraining strategy of MusicLM: SoundStream, w2v-BERT, and Mulan | Image source: here

Furthermore, MusicLM expands its capabilities by allowing melody conditioning. This approach ensures that even an easy hummed tune can lay the muse for an impressive auditory experience, fine-tuned to the precise textual style descriptions.

The developers of MusicLM have also open-sourced MusicCaps, a dataset featuring 5.5k music-text pairs, each accompanied by wealthy text descriptions crafted by human experts. You may test it out here: MusicCaps on Hugging Face.

Able to create AI soundtracks with Google’s MusicLM? Here’s the way to start:

  1. Visit the official MusicLM website and click on “Get Began.”
  2. Join the waitlist by choosing “Register your interest.”
  3. Log in using your Google account.
  4. Once granted access, click “Try Now” to start.

Below are just a few example prompts I experimented with:

“Meditative song, calming and soothing, with flutes and guitars. The music is slow, with a give attention to making a sense of peace and tranquility.”

“jazz with saxophone”

Compared to previous SOTA models corresponding to Riffusion and Mubert in a qualitative evaluation, MusicLM was preferred more over others, with participants favorably rating the compatibility of text captions with 10-second audio clips.

MusicLM Performance comparision

MusicLM Performance, Image source: here

Stability Audio

Stability AI last week introduced “Stable Audio” a latent diffusion model architecture conditioned on text metadata alongside audio file duration and begin time. This approach like Google’s MusicLM has control over the content and length of the generated audio, allowing for the creation of audio clips with specified lengths as much as the training window size.

Stable Audio

Stable Audio

Technical Insights

Stable Audio comprises several components including a Variational Autoencoder (VAE) and a U-Net-based conditioned diffusion model, working along with a text encoder.

An illustration showcasing the integration of a variational autoencoder (VAE), a text encoder, and a U-Net-based conditioned diffusion model

Stable Audio Architecture, Image source: here

The VAE facilitates faster generation and training by compressing stereo audio right into a data-compressed, noise-resistant, and invertible lossy latent encoding, bypassing the necessity to work with raw audio samples.

The text encoder, derived from a CLAP model, plays a pivotal role in understanding the intricate relationships between words and sounds, offering an informative representation of the tokenized input text. That is achieved through the utilization of text features from the penultimate layer of the CLAP text encoder, that are then integrated into the diffusion U-Net through cross-attention layers.

A crucial aspect is the incorporation of timing embeddings, that are calculated based on two properties: the beginning second of the audio chunk and the whole duration of the unique audio file. These values, translated into per-second discrete learned embeddings, are combined with the prompt tokens and fed into the U-Net’s cross-attention layers, empowering users to dictate the general length of the output audio.

The Stable Audio model was trained utilizing an intensive dataset of over 800,000 audio files, through collaboration with stock music provider AudioSparx.

Stable audio commercials

Stable audio Commercials

Stable Audio offers a free version, allowing 20 generations of as much as 20-second tracks per thirty days, and a $12/month Pro plan, permitting 500 generations of as much as 90-second tracks.

Below is an audio clip that I created using stable audio.

Image generated by the author using Midjourney

Image generated by the creator using Midjourney

“Cinematic, Soundtrack Gentle Rainfall, Ambient, Soothing, Distant Dogs Barking, Calming Leaf Rustle, Subtle Wind, 40 BPM”

The applications of such finely crafted audio pieces are limitless. Filmmakers can leverage this technology to create wealthy and immersive soundscapes. Within the industrial sector, advertisers can utilize these tailored audio tracks. Furthermore, this tool opens up avenues for individual creators and artists to experiment and innovate, offering a canvas of unlimited potential to craft sound pieces that narrate stories, evoke emotions, and create atmospheres with a depth that was previously hard to attain with out a substantial budget or technical expertise.

Prompting Suggestions

Craft the proper audio using text prompts. Here’s a fast guide to get you began:

  1. Be Detailed: Specify genres, moods, and instruments. For eg: Cinematic, Wild West, Percussion, Tense, Atmospheric
  2. Mood Setting: Mix musical and emotional terms to convey the specified mood.
  3. Instrument Alternative: Enhance instrument names with adjectives, like “Reverberated Guitar” or “Powerful Choir”.
  4. BPM: Align the tempo with the genre for a harmonious output, corresponding to “170 BPM” for a Drum and Bass track.

Closing Notes

Image generated by the author using Midjourney

Image generated by the creator using Midjourney

In this text, now we have delved into AI-generated music/audio, from algorithmic compositions to the subtle generative AI frameworks of today like Google’s MusicLM and Stability Audio. These technologies, leveraging deep learning and SOTA compression models, not only enhance music generation but additionally fine-tune listeners’ experiences.

Yet, it’s a site in constant evolution, with hurdles like maintaining long-term coherence and the continued debate on the authenticity of AI-crafted music difficult the pioneers on this field. Just every week ago, the thrill was all about an AI-crafted song channeling the kinds of Drake and The Weeknd, which had initially caught fire online earlier this 12 months. Nonetheless, it faced removal from the Grammy nomination list, showcasing the continued debate surrounding the legitimacy of AI-generated music within the industry (source). As AI continues to bridge gaps between music and listeners, it is unquestionably promoting an ecosystem where technology coexists with art, fostering innovation while respecting tradition.


Please enter your comment!
Please enter your name here