
Music generation using deep learning involves training models to create musical compositions, imitating the patterns and structures present in existing music. Deep learning techniques are commonly used, equivalent to RNNs, LSTM networks, and transformer models. This research explores an revolutionary approach for generating musical audio using non-autoregressive, transformer-based models that reply to musical context. This recent paradigm emphasizes listening and responding, unlike existing models that depend on abstract conditioning. The study incorporates recent advancements in the sector and discusses the improvements made to the architecture.
Researchers from SAMI, ByteDance Inc., introduce a non-autoregressive, transformer-based model that listens and responds to musical context, leveraging a publicly available Encodec checkpoint for the MusicGen model. Evaluation employs standard metrics and a music information retrieval descriptor approach, including Frechet Audio Distance (FAD) and Music Information Retrieval Descriptor Distance (MIRDD). The resulting model demonstrates competitive audio quality and robust musical alignment with context, validated through objective metrics and subjective MOS tests.
The research highlights recent strides in end-to-end musical audio generation through deep learning, borrowing techniques from image and language processing. It emphasizes the challenge of aligning stems in music composition and critiques existing models counting on abstract conditioning. It proposes a training paradigm using a non-autoregressive, transformer-based architecture for models that reply to musical context. It introduces two conditioning sources and frames the issue as a conditional generation. Objective metrics, music information retrieval descriptors, and listening tests are mandatory for model evaluation.
The tactic utilizes a non-autoregressive, transformer-based model for music generation, incorporating a residual vector quantizer in a separate audio encoding model. It combines multiple audio channels right into a single sequence element through concatenated embeddings. Training employs a masking procedure, and classifier-free guidance is used during token sampling for enhanced audio context alignment. Objective metrics assess model performance, including Fr’echet Audio Distance and Music Information Retrieval Descriptor Distance. Evaluation involves generating and comparing example outputs with real stems using various metrics.
The study evaluates generated models using standard metrics and a music information retrieval descriptor approach, including FAD and MIRDD. Comparison with real stems indicates that the models achieve audio quality comparable to state-of-the-art text-conditioned models and display strong musical coherence with context. A Mean Opinion Rating test involving participants with music training further validates the model’s ability to provide plausible musical outcomes. MIRDD, assessing the distributional alignment of generated and real stems, provides a measure of musical coherence and alignment.
In conclusion, the research conducted may be summarized in below points:
- The research proposes a brand new training approach for generative models that may reply to musical context.
- The approach introduces a non-autoregressive language model with a transformer backbone and two untested improvements: multi-source classifier-free guidance and causal bias during iterative decoding.
- The models achieve state-of-the-art audio quality by training on open-source and proprietary datasets.
- Standard metrics and a music information retrieval descriptor approach have validated the state-of-the-art audio quality.
- A Mean Opinion Rating test confirms the model’s capability to generate realistic musical outcomes.
Take a look at the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to hitch our 34k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.
Should you like our work, you’ll love our newsletter..
Hello, My name is Adnan Hassan. I’m a consulting intern at Marktechpost and shortly to be a management trainee at American Express. I’m currently pursuing a dual degree on the Indian Institute of Technology, Kharagpur. I’m obsessed with technology and need to create recent products that make a difference.