Music is an art composed of harmony, melody, and rhythm that permeates every aspect of human life. With the blossoming of deep generative models, music generation has drawn much attention in recent times. As a distinguished class of generative models, language models (LMs) showed extraordinary modeling capability in modeling complex relationships across long-term contexts. In light of this, AudioLM and lots of follow-up works successfully applied LMs to audio synthesis. Concurrent with the LM-based approaches, diffusion probabilistic models (DPMs), as one other competitive class of generative models, have also demonstrated exceptional abilities in synthesizing speech, sounds, and music.
Nevertheless, generating music from free-form text stays difficult because the permissible music descriptions could be diverse and relate to genres, instruments, tempo, scenarios, and even some subjective feelings.
Traditional text-to-music generation models often deal with specific properties akin to or , while some models prioritize , which is occasionally conducted by experts in the sphere, akin to . Moreover, most are trained on large-scale music datasets and demonstrated state-of-the-art generative performances with high fidelity and adherence to numerous facets of text prompts.
Yet, the success of those methods, akin to MusicLM or Noise2Music, comes with high computational costs, which might severely impede their practicalities. As compared, other approaches built upon DPMs made efficient samplings of high-quality music possible. Nevertheless, their demonstrated cases were comparatively small and showed limited in-sample dynamics. Aiming for a feasible music creation tool, a high efficiency of the generative model is crucial because it facilitates interactive creation with human feedback being taken under consideration, as in a previous study.
While LMs and DPMs each showed promising results, the relevant query will not be whether one ought to be preferred over one other but whether it is feasible to leverage some great benefits of each approaches concurrently.
Based on the mentioned motivation, an approach termed MeLoDy has been developed. The overview of the strategy is presented within the figure below.
After analyzing the success of MusicLM, the authors leverage the highest-level LM in MusicLM, termed semantic LM, to model the semantic structure of music, determining the general arrangement of melody, rhythm, dynamics, timbre, and tempo. Conditional on this semantic LM, they exploit the non-autoregressive nature of DPMs to model the acoustics efficiently and effectively with the assistance of a successful sampling acceleration technique.
Moreover, the authors propose the so-called dual-path diffusion (DPD) model as an alternative of adopting the classic diffusion process. Indeed, working on the raw data would exponentially increase the computational expenses. The proposed solution is to scale back the raw data to a low-dimensional latent representation. Reducing the dimensionality of the information hinders its impact on the operations and, hence, decreases the model running time. Afterward, the raw data could be reconstructed from the latent representation through a pre-trained autoencoder.
Some output samples produced by the model can be found at the next link: https://efficient-melody.github.io/. The code has yet to be available, which implies that, in the meanwhile, it will not be possible to try it out, either online or locally.
This was the summary of MeLoDy, an efficient LM-guided diffusion model that generates music audios of state-of-the-art quality. When you have an interest, you’ll be able to learn more about this system within the links below.
Check Out The Paper. Don’t forget to hitch our 25k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more. If you will have any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com
Featured Tools From AI Tools Club
🚀 Check Out 100’s AI Tools in AI Tools Club
Daniele Lorenzi received his M.Sc. in ICT for Web and Multimedia Engineering in 2021 from the University of Padua, Italy. He’s a Ph.D. candidate on the Institute of Information Technology (ITEC) on the Alpen-Adria-Universität (AAU) Klagenfurt. He’s currently working within the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.