
That is the third article on spoken language recognition based on the Mozilla Common Voice dataset. In Part I, we discussed data selection and data preprocessing and in Part II we analysed performance of several neural network classifiers.
The ultimate model achieved 92% accuracy and 97% pairwise accuracy. Since this model suffers from somewhat high variance, the accuracy could potentially be improved by adding more data. One quite common solution to get extra data is to synthesize it by performing various transformations on the available dataset.
In this text, we are going to consider 5 popular transformations for audio data augmentation: adding noise, changing speed, changing pitch, time masking, and cut & splice.
The tutorial notebook might be found here.
For illustration purposes, will use the sample common_voice_en_100040 from the Mozilla Common Voice (MCV) dataset. That is the sentence The burning fire had been extinguished.
import librosa as lr
import IPythonsignal, sr = lr.load('./transformed/common_voice_en_100040.wav', res_type='kaiser_fast') #load signal
IPython.display.Audio(signal, rate=sr)
Adding noise is the only audio augmentation. The quantity of noise is characterised by the signal-to-noise ratio (SNR) — the ratio between maximal signal amplitude and standard deviation of noise. We’ll generate several noise levels, defined with SNR, and see how they modify the signal.
SNRs = (5,10,100,1000) #Signal-to-noise ratio: max amplitude over noise stdnoisy_signal = {}
for snr in SNRs:
noise_std = max(abs(signal))/snr #get noise std
noise = noise_std*np.random.randn(len(signal),) #generate noise with given std
noisy_signal[snr] = signal+noise
IPython.display.display(IPython.display.Audio(noisy_signal[5], rate=sr))
IPython.display.display(IPython.display.Audio(noisy_signal[1000], rate=sr))
So, SNR=1000 sounds almost just like the unperturbed audio, while at SNR=5 one can only distinguish the strongest parts of the signal. In practice, the SNR level is hyperparameter that is dependent upon the dataset and the chosen classifier.
The only solution to change the speed is simply to pretend that the signal has a distinct sample rate. Nonetheless, this can even change the pitch (how low/high in frequency the audio sounds). Increasing the sampling rate will make the voice sound higher. As an example this we will “increase” the sampling rate for our example by 1.5:
IPython.display.Audio(signal, rate=sr*1.5)
Changing the speed without affecting the pitch is more difficult. One needs to make use of the Phase Vocoder(PV) algorithm. Briefly, the input signal is first split into overlapping frames. Then, the spectrum inside each frame is computed by applying Fast Fourier Transformation (FFT). The playing speed is then modifyed by resynthetizing frames at a distinct rate. Because the frequency content of every frame shouldn’t be affected, the pitch stays the identical. The PV interpolates between the frames and uses the phase information to realize smoothness.
For our experiments, we are going to use the stretch_wo_loop time stretching function from this PV implementation.
stretching_factor = 1.3signal_stretched = stretch_wo_loop(signal, stretching_factor)
IPython.display.Audio(signal_stretched, rate=sr)
So, the duration of the signal decreased since we increased the speed. Nonetheless, one can hear that the pitch has not modified. Note that when the stretching factor is substantial, the phase interpolation between frames won’t work well. In consequence, echo artefacts may appear within the transformed audio.
To change the pitch without affecting the speed, we are able to use the identical PV time stretch but pretend that the signal has a distinct sampling rate such that the full duration of the signal stays the identical:
IPython.display.Audio(signal_stretched, rate=sr/stretching_factor)
Why will we ever trouble with this PV while librosa already has time_stretch and pitch_shift functions? Well, these functions transform the signal back to the time domain. When it’s worthwhile to compute embeddings afterwards, you’ll lose time on redundant Fourier transforms. However, it is straightforward to switch the stretch_wo_loop function such that it yields Fourier output without taking the inverse transform. One could probably also attempt to dig into librosa codes to realize similar results.
These two transformation were initially proposed within the frequency domain (Park et al. 2019). The concept was to save lots of time on FFT through the use of precomputed spectra for audio augmentations. For simplicity, we are going to reveal how these transformations work within the time domain. The listed operations might be easily transferred to the frequency domain by replacing the time axis with frame indices.
Time masking
The concept of time masking is to cover up a random region within the signal. The neural network has then less possibilities to learn signal-specific temporal variations that aren’t generalizable.
max_mask_length = 0.3 #maximum mask duration, proportion of signal lengthL = len(signal)
mask_length = int(L*np.random.rand()*max_mask_length) #randomly select mask length
mask_start = int((L-mask_length)*np.random.rand()) #randomly select mask position
masked_signal = signal.copy()
masked_signal[mask_start:mask_start+mask_length] = 0
IPython.display.Audio(masked_signal, rate=sr)
Cut & splice
The concept is to interchange a randomly chosen region of the signal with a random fragment from one other signal having the identical label. The implementation is sort of the identical as for time masking, except that a bit of one other signal is placed as a substitute of the mask.
other_signal, sr = lr.load('./common_voice_en_100038.wav', res_type='kaiser_fast') #load second signalmax_fragment_length = 0.3 #maximum fragment duration, proportion of signal length
L = min(len(signal), len(other_signal))
mask_length = int(L*np.random.rand()*max_fragment_length) #randomly select mask length
mask_start = int((L-mask_length)*np.random.rand()) #randomly select mask position
synth_signal = signal.copy()
synth_signal[mask_start:mask_start+mask_length] = other_signal[mask_start:mask_start+mask_length]
IPython.display.Audio(synth_signal, rate=sr)