
Within the realm of text-to-music synthesis, the standard of generated content has been advancing, however the controllability of musical features stays unexplored. A team of researchers from Singapore University of Technology and Design and the Queen Mary University of London introduced an answer to this challenge, named Mustango, extends the Tango text-to-audio model, aiming to manage generated music not only with general text captions but with richer captions containing specific instructions related to chords, beats, tempo, and key.
The researchers introduce Mustango as a music-domain-knowledge-inspired text-to-music system based on diffusion models. They highlight the unique challenges in generating music directly from a diffusion model, emphasizing the necessity to balance alignment with conditional text and musicality. Mustango enables musicians, producers, and sound designers to create music clips with specific conditions reminiscent of chord progression, tempo, and key selection.
As a part of Mustango, the researchers propose MuNet, a Music-Domain-Knowledge-Informed UNet sub-module. MuNet integrates music-specific features, predicted from the text prompt, including chords, beats, key, and tempo, into the diffusion denoising process. To beat the limited availability of open datasets with music and text captions, the researchers introduce a novel data augmentation method. This method involves altering the harmonic, rhythmic, and dynamic features of music audio and using Music Information Retrieval methods to extract music features, that are then appended to existing text descriptions, leading to the MusicBench dataset.
The MusicBench dataset comprises over 52,000 instances, enriching the unique text descriptions with beats, downbeats location, underlying chord progression, key, and tempo. The researchers conduct extensive experiments demonstrating that Mustango achieves state-of-the-art music quality. They emphasise the controllability of Mustango through music-specific text prompts, showcasing superior performance in capturing desired chords, beats, keys, and tempo across multiple datasets. They assess the adaptability of those predictors in scenarios where control sentences are absent from the prompt and observe that Mustango outperforms Tango in such cases, indicating that the control predictors don’t compromise performance.
The experiments include comparisons with baselines, reminiscent of Tango, and variants of Mustango, demonstrating the effectiveness of the proposed data augmentation approach in enhancing performance. Mustango trained from scratch is highlighted as the perfect performer, surpassing Tango and other variants by way of audio quality, rhythm presence, and harmony. Mustango has 1.4B parameters, far more than that of Tango.
In conclusion, the researchers introduce Mustango as a major advancement in text-to-music synthesis. They address the controllability gap in existing systems and reveal the effectiveness of their proposed method through extensive experiments. Mustango not only achieves state-of-the-art music quality but additionally provides enhanced controllability, making it a beneficial contribution to the sector. The researchers release the MusicBench dataset, offering a resource for future research in text-to-music synthesis.
Try the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to affix our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.
If you happen to like our work, you’ll love our newsletter..
Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Kharagpur. She is a tech enthusiast and has a keen interest within the scope of software and data science applications. She is all the time reading in regards to the developments in several field of AI and ML.