Home Artificial Intelligence Spoken language recognition on Mozilla Common Voice — Part II: Models. Model comparison Pairwise accuracy Conclusion References

Spoken language recognition on Mozilla Common Voice — Part II: Models. Model comparison Pairwise accuracy Conclusion References

0
Spoken language recognition on Mozilla Common Voice — Part II: Models.
Model comparison
Pairwise accuracy
Conclusion
References

Towards Data Science
Photo by Jonathan Velasquez on Unsplash

That is the second article on spoken language recognition based on Mozilla Common Voice dataset. In the primary part we discussed data selection and selected optimal embedding. Allow us to now train several models and choose the perfect one.

We’ll now train and evaluate the next models on the complete data (40K samples, see the primary part for more information on data selection and preprocessing):

· Convolutional neural network (CNN) model. We simply treat language classification problem as classification of 2-dimensional images. CNN-based classifiers showed promising ends in a language recognition TopCoder competition.

CNN architecture (Image by the writer, created with PlotNeuralNet)

· CRNN model from Bartz et al. 2017. A CRNN combines the descriptive power of CNNs with the flexibility to capture temporal features of RNN.

CRNN architecture (image from Bartz et al., 2017)

· CRNN model from Alashban et al. 2022. That is just one other variation of the CRNN architecture.

· AttNN: model from De Andrade et al. 2018. This model was initially proposed for speech recognition and subsequently applied for spoken language recognition within the Intelligent Museum project. Along with convolution and LSTM units, this model has a subsequent attention block that’s trained to weigh parts of the input sequence (namely frames on which Fourier transform is computed) in keeping with their relevance for classification.

· CRNN* model: same architecture as AttNN, but no attention block.

· Time-delay neural network (TDNN) model. The model we test here was used to generate X-vector embeddings for spoken language recognition in Snyder et al. 2018. In our study, we bypass X-vector generation and directly train the network to categorise languages.

All models were trained based on the identical train/val/test split and the identical mel spectrogram embeddings with the primary 13 mel filterbank coefficients. The models will be found here.

The resulting learning curves on the validation set are shown on the figure below (each “epoch” refers to 1/8 of the dataset).

Performance of various models on Mozilla Common Voice dataset (image by the writer).

The next table shows mean and standard deviation for the accuracy based on 10 runs.

accuracy for every model (image by the writer)

It could possibly be clearly seen that AttNN, TDNN, and our CRNN* model perform similarly, with AttNN scoring the first with 92.4% accuracy. Then again, CRNN (Bartz et al. 2017), CNN, and CRNN (Alashban et al. 2022) showed very modest performance with CRNN (Alashban et al. 2022) closing the list with only 58.5% accuracy.

We then trained the winning AttNN model on the train and val sets and evaluated on the test set. The test accuracy of 92.4% (92.4% for men and 92.3% for ladies) turned out to be near validation accuracy, which indicates that the model didn’t overfit on the validation set.

To know the performance difference between the evaluated models, we first note that TDNN and AttNN were specifically designed for speech recognition tasks and already tested against previous benchmarks. This is perhaps the explanation why these models come out on top.

The performance gap between AttNN and our CRNN model (the identical architecture but no attention block) proves the relevance of the eye mechanism for spoken language recognition. The next CRNN model (Bartz et al. 2017) performs worse despite its similar architecture. This might be simply because the default model hyperparameters should not optimal for the MCV dataset.

The CNN model doesn’t possess any specific memory mechanism and comes next. Strictly speaking, the CNN has some notion of memory since computing convolution involves a hard and fast variety of consecutive frames. Higher layers thus encapsulate information of even longer time intervals because of the hierarchical nature of CNNs. The truth is, the TDNN model, which scored the second, is perhaps viewed as a 1-D CNN. So, with more time invested in CNN architecture search, the CNN model might need performed closely to TDNN.

The CRNN model from Alashban et al. 2022 surprisingly shows the worst accuracy. It’s interesting that this model was initially designed to acknowledge languages in MCV and showed accuracy of about 97%, as reported in the unique study. For the reason that original code is just not publicly available, it could be difficult to find out the source of this massive discrepancy.

In lots of cases the user employs recurrently not more than 2 languages. On this case, a more appropriate metric of model performance is pairwise accuracy, which is nothing greater than accuracy computed on a given pair of languages ignoring the scores for all other languages.

The pairwise accuracy for the AttNN model on the test set is shown within the table below next to the confusion matrix, the recall for individual languages being on diagonal. The typical pairwise accuracy is 97%. Pairwise accuracy will at all times be higher than accuracy since only 2 languages must be distinguished.

Confusion matrix (left) and pairwise accuracy (right) of the AttNN model (image by the writer).

So, the model distinguishes the perfect between German (de) and Spanish (es) in addition to French (fr) and English (en) (98%). This is just not surprising because the sound system is kind of different in these languages.

Although we used softmax loss to coach the model, it was previously reported that higher accuracy is perhaps achieved in pairwise classification with tuplemax loss (Wan et al. 2019).

To check the effect of tuplemax loss, we retrained our model after implementing tuplemax loss in PyTorch (see here for implementation). The figure below compares the effect of softmax loss and tuplemax loss on accuracy and on pairwise accuracy when evaluated on the validation set.

Accuracy and pairwise accuracy of the AttNN model computed with softmax and tuplemax loss (image by the writer).

As will be observed, tuplemax loss performs worse when overall accuracy (paired t-test pvalue=0.002) or pairwise accuracy is compared (paired t-test pvalue=0.2).

The truth is, even the unique study fails to clarify clearly why tuplemax loss should do higher. Here is the instance that the authors make:

Explanation of tuplemax loss (image from Wan et al., 2019)

Absolutely the value of loss doesn’t actually mean much. With enough training iterations, this instance is perhaps classified accurately with one or the opposite loss.

Anyhow, tuplemax loss is just not a flexible solution and the selection of loss function needs to be fastidiously leveraged for every given problem.

We reached 92% accuracy and 97% pairwise accuracy in spoken language recognition of short audio clips from the Mozilla Common Voice (MCV) dataset. German, English, Spanish, French, and Russian languages were considered.

In a preliminary study comparing mel spectrogram, MFCC, RASTA-PLP, and GFCC embeddings we discovered that mel spectrograms with the primary 13 filterbank coefficients resulted in the very best recognition accuracy.

We next compared the generalization performance of 5 neural network models: CNN, CRNN (Bartz et al. 2017), CRNN (Alashban et al. 2022), AttNN (De Andrade et al. 2018), CRNN*, and TDNN (Snyder et al. 2018). Amongst all of the models, AttNN showed the perfect performance, which highlights the importance of LSTM and a focus blocks for spoken language recognition.

Finally, we computed the pairwise accuracy and studied the effect of tuplemax loss. It seems, that tuplemax loss degrades each accuracy and pairwise accuracy in comparison with softmax.

In conclusion, our results constitute a brand new benchmark for spoken language recognition on the Mozilla Common Voice dataset. Higher results may very well be achieved in future studies by combining different embeddings and extensively investigating promising neural network architectures, e.g. transformers.

In Part III we’ll discuss which audio transformations might help to enhance model performance.

  • Alashban, Adal A., et al. “Spoken language identification system using convolutional recurrent neural network.” Applied Sciences 12.18 (2022): 9181.
  • Bartz, Christian, et al. “Language identification using deep convolutional recurrent neural networks.” Neural Information Processing: twenty fourth International Conference, ICONIP 2017, Guangzhou, China, November 14–18, 2017, Proceedings, Part VI 24. Springer International Publishing, 2017.
  • De Andrade, Douglas Coimbra, et al. “A neural attention model for speech command recognition.” arXiv preprint arXiv:1808.08929 (2018).
  • Snyder, David, et al. “Spoken language recognition using x-vectors.” Odyssey. Vol. 2018. 2018.
  • Wan, Li, et al. “Tuplemax loss for language identification.” ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

LEAVE A REPLY

Please enter your comment!
Please enter your name here