Up until now, most generative music models have been producing mono sound. This implies MusicGen doesn’t place any sounds or instruments on the left or right side, leading to a less vigorous and exciting mix. The rationale why stereo sound has been mostly neglected to this point is that generating stereo isn’t a trivial task.
As musicians, after we produce stereo signals, we’ve access to the person instrument tracks in our mix and we are able to place them wherever we would like. MusicGen doesn’t generate all instruments individually but as an alternative produces one combined audio signal. Without access to those instrument sources, creating stereo sound is difficult. Unfortunately, splitting an audio signal into its individual sources is a tricky problem (I’ve published a blog post about that) and the tech remains to be not 100% ready.
Subsequently, Meta decided to include stereo generation directly into the MusicGen model. Using a brand new dataset consisting of stereo music, they trained MusicGen to supply stereo outputs. The researchers claim that generating stereo has no additional computing costs in comparison with mono.
Although I feel that the stereo procedure isn’t very clearly described within the paper, my understanding it really works like this (Figure 3): MusicGen has learned to generate two compressed audio signals (left and right channel) as an alternative of 1 mono signal. These compressed signals must then be decoded individually before they’re combined to construct the ultimate stereo output. The rationale this process doesn’t take twice as long is that MusicGen can now produce two compressed audio signals at roughly the identical time it previously took for one signal.
With the ability to produce convincing stereo sound really sets MusicGen aside from other state-of-the-art models like MusicLM or Stable Audio. From my perspective, this “little” addition makes an enormous difference within the liveliness of the generated music. Listen for yourselves (is likely to be hard to listen to on smartphone speakers):
Mono
Stereo
MusicGen was impressive from the day it was released. Nevertheless, since then, Meta’s FAIR team has been continually improving their product, enabling higher quality results that sound more authentic. With regards to text-to-music models generating audio signals (not MIDI etc.), MusicGen is ahead of its competitors from my perspective (as of November 2023).
Further, since MusicGen and all its related products (EnCodec, AudioGen) are open-source, they constitute an incredible source of inspiration and a go-to framework for aspiring AI audio engineers. If we take a look at the improvements MusicGen has made in just 6 months, I can only imagine that 2024 will probably be an exciting yr.
One other necessary point is that with their transparent approach, Meta can be doing foundational work for developers who need to integrate this technology into software for musicians. Generating samples, brainstorming musical ideas, or changing the genre of your existing work — these are a few of the exciting applications we’re already beginning to see. With a sufficient level of transparency, we are able to be sure we’re constructing a future where AI makes creating music more exciting as an alternative of being only a threat to human musicianship.