Home Artificial Intelligence Tiny Audio Diffusion: Waveform Diffusion That Doesn’t Require Cloud Computing Background Tiny Audio Diffusion Model Architecture Conclusion

Tiny Audio Diffusion: Waveform Diffusion That Doesn’t Require Cloud Computing Background Tiny Audio Diffusion Model Architecture Conclusion

0
Tiny Audio Diffusion: Waveform Diffusion That Doesn’t Require Cloud Computing
Background
Tiny Audio Diffusion
Model Architecture
Conclusion

The tactic of teaching a model to perform this denoising process may very well be a bit counter-intuitive from an initial thought. The model actually learns to denoise a signal by doing the precise opposite — adding noise to a clean signal over and yet again until only noise stays. The thought is that if the model can learn predict the noise added to a signal at each step, then it may possibly also predict the noise removed at each step for the reverse process. The critical element to make this possible is that the noise being added/removed must be of an outlined probabilistic distribution (typically Gaussian) in order that the noising/denoising steps are predictable and repeatable.

There may be way more detail that goes into this process, but this could provide a sound conceptual understanding of what is going on under the hood. In case you are excited about learning more about diffusion models (mathematical formulations, scheduling, latent space, etc.), I like to recommend reading this blog post by AssemblyAI and these papers (DDPM, Improving DDPM, DDIM, Stable Diffusion).

Understanding Audio for Machine Learning

My interest in diffusion stems from the potential that it has shown with generative audio. Traditionally, to coach ML algorithms, audio was converted right into a spectrogram, which is essentially a heatmap of sound energy over time. This was because a spectrogram representation was just like a picture, which computers are exceptional at working with, and it was a big reduction in data size in comparison with a raw waveform.

Spectrogram
Example Spectrogram of a Vocalist

Nevertheless, with this transformation come some tradeoffs, including a discount of resolution and a lack of phase information. The phase of an audio signal represents the position of multiple waveforms relative to at least one one other. This could be demonstrated within the difference between a sine and a cosine function. They represent the identical exact signal regarding amplitude, the one difference is a 90° (π/2 radians) phase shift between the 2. For a more in-depth explanation of phase, try this video by Akash Murthy.

phase shift of 90° between sin and cos
90° phase shift between sin and cos

Phase is a perpetually difficult concept to understand, even for many who work in audio, but it surely plays a critical role in creating the timbral qualities of sound. Suffice it to say that it mustn’t be discarded so easily. Phase information also can technically be represented in spectrogram form (the complex portion of the transform), identical to magnitude. Nevertheless, the result’s noisy and visually appears random, making it difficult for a model to learn any useful information from it. For this reason drawback, there was recent interest in refraining from transforming audio into spectrograms and slightly leaving it as a raw waveform to coach models. While this brings its own set of challenges, each the amplitude and phase information are contained throughout the single signal of a waveform, providing a model with a more holistic picture of sound to learn from.

WAveform
Example Waveform of a Vocalist

It is a key piece of my interest in waveform diffusion, and it has shown promise in yielding high-quality results for generative audio. Waveforms, nonetheless, are very dense signals requiring a big amount of information to represent the range of frequencies humans can hear. For instance, the music industry standard sampling rate is 44.1kHz, which suggests that 44,100 samples are required to represent just 1 second of mono audio. Now double that for stereo playback. For this reason, most waveform diffusion models (that don’t leverage latent diffusion or other compression methods) require high GPU capability (normally at the very least 16GB+ VRAM) to store all of the data while being trained.

Motivation

Many individuals do not need access to high-powered, high-capacity GPUs, or don’t need to pay the fee to rent cloud GPUs for private projects. Finding myself on this position, but still wanting to explore waveform diffusion models, I made a decision to develop a waveform diffusion system that would run on my meager local hardware.

Hardware Setup

I used to be equipped with an HP Spectre laptop from 2017 with an eighth Gen i7 processor and GeForce MX150 graphics card with 2GB VRAM — not what you’ll call a powerhouse for training ML models. My goal was to have the option to create a model that would train and produce high-quality (44.1kHz) stereo outputs on this method.

I leveraged Archinet’s audio-diffusion-pytorch library to construct this model — thanks to Flavio Schneider for his help working with this library that he largely built.

Attention U-Net

The bottom model architecture consists of a U-Net with attention blocks which is standard for contemporary diffusion models. A U-Net is a neural network that was originally developed for image (2D) segmentation but has been adapted to audio (1D) for our uses with waveform diffusion. The U-Net architecture gets its name from its U-shaped design.

U-Net
U-Net (Source: U-Net: Convolutional Networks for Biomedical Image Segmentation (Ronneberger, et. al))

Very just like an autoencoder, consisting of an encoder and a decoder, a U-Net also incorporates skip connections at each level of the network. These skip connections are direct connections between corresponding layers of the encoder and decoder, facilitating the transfer of fine-grained details from the encoder to the decoder. The encoder is chargeable for capturing the essential features of the input signal, while the decoder is chargeable for generating the brand new audio sample. The encoder step by step reduces the resolution of the input audio, extracting features at different levels of abstraction. The decoder then takes these features and upsamples them, step by step increasing the resolution to generate the ultimate audio sample.

Attention U-Net
Attention U-Net (Source: Attention U-Net: Learning Where to Search for the Pancreas (Oktay, et al.))

This U-Net also incorporates self-attention blocks on the lower levels which help maintain the temporal consistency of the output. It’s critical for the audio to be downsampled sufficiently to keep up efficiency for sampling in the course of the diffusion process in addition to avoid overloading the eye blocks. The model leverages V-Diffusion which is a diffusion technique inspired by DDIM sampling.

To avoid running out of GPU VRAM, the length of the information that the bottom model was to be trained on needed to be short. For this reason, I made a decision to coach one-shot drum samples as a result of their inherently short context lengths. After many iterations, the bottom model length was determined to be 32,768 samples @ 44.1kHz in stereo, which ends up in roughly 0.75 seconds. This may increasingly seem particularly short, but it surely is loads of time for many drum samples.

Transforms

To downsample the audio enough for the eye blocks, several pre-processing transforms were attempted. The hope was that if the audio data may very well be downsampled without losing significant information prior to training the model, then the variety of nodes (neurons) and layers may very well be maximized without increasing the GPU memory load.

The primary transform attempted was a version of “patching”. Originally proposed for images, this process was adapted to audio for our purposes. The input audio sample is grouped by sequential time steps into chunks which can be then transposed into channels. This process could then be reversed on the output of the U-Net to un-chunk the audio back to its full length. The un-chunking process created aliasing issues, nonetheless, leading to undesirable high frequency artifacts within the generated audio.

The second transform attempted, proposed by Schneider, known as a “Learned Transform” which consists of single convolutional blocks with large kernel sizes and strides firstly and end of the U-Net. Multiple kernel sizes and strides were attempted (16, 32, 64) coupled with accompanying model variations to appropriately downsample the audio. Again, nonetheless, this resulted in aliasing issues within the generated audio, though not as prevalent because the patching transform.

For this reason, I made a decision that the model architecture would have to be adjusted to accommodate the raw audio with no pre-processing transforms to supply sufficient quality outputs.

This required extending the variety of layers throughout the U-Net to avoid downsampling too quickly and losing essential features along the best way. After multiple iterations, the most effective architecture resulted in downsampling by only 2 at each layer. While this required a discount within the variety of nodes per layer, it ultimately produced the most effective results. Detailed information concerning the exact variety of U-Net levels, layers, nodes, attention features, etc. could be present in the configuration file within the tiny-audio-diffusion repository on GitHub.

Pre-Trained Models

I trained 4 separate unconditional models to supply kicks, snare drums, hi-hats, and percussion (all drum sounds). The datasets used for training were small free one-shot samples that I had collected for my music production workflows (all open-source). Larger, more varied datasets would improve the standard and variety of every model’s generated outputs. The models were trained for a various variety of steps and epochs depending on the scale of every dataset.

Pre-trained models can be found for download on Hugging Face. See the training progress and output samples logged at Weights & Biases.

Results

Overall, the standard of the output is sort of high regardless of the reduced size of the models. Nevertheless, there continues to be some slight high frequency “hiss” remaining, which is probably going as a result of the limited size of the model. This could be seen within the small amount of noise remaining within the waveforms below. Most samples generated are crisp, maintaining transients and broadband timbral characteristics. Sometimes the models add extra noise toward the top of the sample, and this is probably going a value of the limit of layers and nodes of the model.

Take heed to some output samples from the models here. Example outputs from each model are shown below.

LEAVE A REPLY

Please enter your comment!
Please enter your name here