k-Nearest Neighbors - Acoustic Scene Classification With L3 Embeddings: Transfer learning exper

The k-NN classifier was trained with the number of neighbors being 1 (making it 1-NN), uniform weights, automatic computing algorithm and Minkowski metric.

Since the results varied slightly with each program execution, it was run five times, and the results were then averaged. The average overall accuracy was 76 %. The confusion matrix for the average accuracies is shown in figure 3.2.

Figure 3.2. Confusion matrix for the average k-NN predictions.

The k-NN learned to distinguish most of the categories quite well. Some of the categories were easier to learn than the others - airport, park, shopping_mall and street_traffic had classification accuracy of at least 90 %, while metro, street_pedestrian and tram had an accuracy of less than 70 %. Particularly metro was often confused with metro_station, which seems logical since their environments overlap each other. It is also interesting to point out that the confusion did not happen to the opposite direction as often, that is metro_station was not confused with metro. Metro and tram were also confused with more than 10 % of the samples - again this seems rational as a sound of a metro with passengers could resemble that of a tram. The average performance of 70 - 90 % was achieved for categories bus, metro_station and public_square.

tecture was used because the simple design of the network was observed to work in the study for a similar problem. The network consists of 3 dense layers of 512, 128 and 10 perceptrons, respectively, with ReLU activation for the first two layers and softmax activa-tion for the final layer, as shown in figure 3.3. This amounts to 67,064 parameters, which is very low in the domain of deep learning. Uniform kernel initializer was used, and the bias was initialized to zero. The model uses binary cross-entropy loss, Adam optimizer and accuracy as the metric, based on which early stopping is conducted.

embeddings

The train set was split into train and validation sets by randomly splitting the data with a proportion of 70/30, with a constraint that no samples with a given group identification label belonging to the training set should be included in the validation set. This is im-portant since the label defines the location where the audio sample was recorded - the validation is performed with disjoint location audio data. Based on the validation set, it was possible to recognize if the FNN was starting to overfit. The FNN was trained with a batch size of 512 for arbitrary epochs until the early stopping would commence. The early stopping parameters used were minimum delta of 0, a patience of 20 epochs and a baseline requirement of 60 %. Practically this resulted in training lasting for between 250 and 700 epochs. Obviously, the chosen hyperparameters could be optimized further.

The chosen values weread hoc, based on a few iterations with different values. A proper optimization would use more advanced methods such as grid search.

Since the embeddings were of raw data, meaning that each embedding represents one frame of the 10 seconds audio clip, a prediction was made for each frame. The final class for the audio clip was then decided with a majority vote among the individual classes to predict the most often predicted class.

The accuracy of the FNN model while training is shown in figure 3.4.

The graphical representation for the training process shown is for an individual execu-tion, but it represents all executions well. Epochs trained grows from left to right, and

Figure 3.4. FNN model accuracy over epochs trained, randomly selected plot from the five runs.

the accuracy of the model grows from down to up. The blue line represents the model accuracy for the training data; the orange line for the validation data. The model accuracy on the training data improves logarithmically as the number of epochs increases, but the accuracy on the validation data seemingly saturates on the earlier epochs.

As with the k-NN, results slightly varied each program execution, and the program was run five times. The results were then averaged, resulting in an average overall accuracy of 81%. Therefore the model accuracy in figure 3.4 seems unusually high and does not represent the actual classification accuracy for the system with the test data, based on the fact that actual results at the end of the training are much lower. Possibly, the train/validation split resulted in a combination that allows this high performance. The confusion matrix for the average accuracies is shown in figure 3.5.

The FNN learned to discriminate the categories very well. Categories airport, bus, park, shopping_mall and street_traffic scored above 90 % accuracy, with bus’ accuracy being highest among all classes, at 97 %. Below 70 % accuracy was scored by metro, pub-lic_square and street_pedestrian. The FNN also confused metro often with metro station and tram which, as stated in section 3.4, seems logical. The average performance of 70 - 90 % was achieved for categories metro station and tram. It seems like the FNN tended to learn some categories well while a few categories’ classification accuracies remained low.

Figure 3.5. Confusion matrix for the average FNN predictions.

4 DISCUSSION

The classification accuracies of the baseline system (provided in the DCASE challenge), k-NN and FNN are presented in table 4.1.

Table 4.1. Summary of the results for comparison between the baseline, k-NN and FNN.

baseline (ACC [%]) k-NN (ACC [%]) FNN (ACC [%])

Both classifiers that used OpenL3 embeddings had a significantly better overall aver-age accuracy than the baseline system. The FNN classifier also outperformed the k-NN. Class-wise performances of k-NN and FNN were quite similar although FNN scored higher or similar accuracies for all the categories except public square. Interestingly, the baseline system got much higher accuracy in the metro category compared to the k-NN and the FNN.

The baseline system used log mel-band energies of 40 ms windows with 50 % hop size as the features, trained with a network consisting of two convolutional neural network (CNN) layers and one fully connected layer [12]. These are not discussed in this thesis, but it is sufficient to say that the log mel-band energies are a feature representation that approximate the human auditory system’s response closer than the linearly-spaced frequency bands and therefore could allow for better representation of sound. In contrast, linear representation of audio was used with the L3-embeddings. The disparity of the features with baseline system versus k-NN or FNN may explain why they also perform very differently - for example, the baseline system seems to work better for metro class, but it struggles to discriminate airport, where the k-NN and FNN seem to excel.

the task at hand and is also very simple, with only 67 K parameters. In comparison, the top system is an ensemble of 7 CNNs with 48 M parameters. Based on these results, it seems plausible that the OpenL3 embeddings can be used in transfer learning to achieve high performance in audio-related tasks, which is the same observation that the authors in [6] made.

It is beneficial to compare the performance of the classifiers against a human equivalent to understand better how good the system actually is at recognizing different auditory scenes. Unfortunately, at the time of writing this thesis, there were no statistics on human performance on the task’s data set. However, the human performance was tested with somewhat similar data in the TUT Acoustic Scenes 2016 dataset, where there were 30 seconds samples of 15 different categories [19]. The subjects were first familiarized briefly with some examples of different categories, and then they undertook the tests.

The obtained confusion matrix of the participants’ accuracies is presented in figure 4.1.

Figure 4.1. Human performance on the audio environment classification task in the Acoustic Scenes 2016 dataset.

As can be seen, in the TUT Acoustic Scenes 2016 dataset there were a few categories that are the same as in TAU Urban Acoustic Scenes 2019. However, the straightforward comparison between the performances in these two challenges is not sensible because they operate on different data. The subjects had difficulties in discriminating many of the acoustic scenes, and on average didn’t perform over 90 % on any of the categories.

On average they had a total accuracy of 54 %. It is to be noted that the study also

featured an expert listener, who had been training explicitly with the data before testing -they had an average accuracy of 77 %. In comparison, the baseline and state-of-the-art implementations for the challenge scored with the average accuracy of 77 % and 90 %, respectively [18]. It can be safely stated that the machine learning approaches usually outclass the human performance on audio environment classification in these closed experiments.

5 CONCLUSIONS

This thesis proposed a transfer learning method using L3 embeddings in the TAU Urban Acoustic Scenes 2019 audio environment classification problem. A system that used k-NN and FNN classifiers was implemented for this task. Over five runs, the k-NN and the FNN models achieved an average accuracy of 76 % and 81 %, respectively. These performances were achieved without extensive optimization of hyperparameters or layer configuration in the latter case. They outperform the baseline system which had the average accuracy of 63 % and come close to the state-of-the-art implementation which had an accuracy of 85 %, with much simpler model - in the case of the FNN with only 0.14 % of the weights.

Based on the obtained results, it seems that the L3-embeddings generalize well for dif-ferent various audio domain downstream tasks. Models that utilize deep learning can achieve better results than simpler ones with these embeddings - here the FNN model reached 5 % higher classification accuracy than the k-NN model. Transfer learning with the L3-embeddings could yield even better results with implementations optimized for the downstream task, perhaps competing with the state-of-the-art systems or even outper-forming them. This task is left open for future research.

REFERENCES

[1] E. Alpaydin.Introduction to machine learning. MIT press, 2020.

[2] N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regres-sion.The American Statistician46.3 (1992), 175–185.

[3] R. Arandjelovic and A. Zisserman. Look, listen and learn.Proceedings of the IEEE International Conference on Computer Vision. 2017, 609–617.

[4] J. Boelaert and É. Ollion. The Great Regression. Revue française de sociologie 59.3 (2018), 475–506.

[5] M. Claesen and B. De Moor. Hyperparameter search in machine learning. arXiv preprint arXiv:1502.02127 (2015).

[6] J. Cramer, H.-H. Wu, J. Salamon and J. P. Bello. Look, listen, and learn more:

Design choices for deep audio embeddings.ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, 3852–3856.

[7] J. W. Dennis. Sound event recognition and classification in unstructured environ-ments. PhD thesis. PhD thesis, Nanyang Technological University, 2011.

[8] R. O. Duda, P. E. Hart and D. G. Stork.Pattern classification. John Wiley & Sons, 2012.

[9] D. P. Ellis. A history and overview of machine listening. (2010).

[10] I. Goodfellow, Y. Bengio and A. Courville.Deep learning. MIT press, 2016.

[11] M. Grant.Nonparametric Statistics: Overview. 2019.

[12] T. Heittola. DCASE2019 Challenge Task 1 baseline system. 2019. URL: https : //github.com/toni-heittola/dcase2019_task1_baseline.

[13] K. Hinkelmann.Neural Networks, p. 7. 2018.

[14] G. Hinton. Advanced Machine Learning, Lecture 10: Recurrent neural networks.

2013.

[15] W. Koehrsen. Neural network embeddings explained.Towards Data Science, via Medium, October 2 (2018).

[16] M. Kuhn and K. Johnson.Applied predictive modeling. Vol. 26. Springer, 2013.

[17] N. D. Lane, P. Georgiev and L. Qendro. DeepEar: robust smartphone audio sensing in unconstrained acoustic environments using deep learning. Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing.

2015, 283–294.

[18] A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen and M. D.

Plumbley. Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing26.2 (Feb. 2018), 379–393.ISSN: 2329-9290.DOI:10.1109/

TASLP.2017.2778423.

[20] A. Mesaros, T. Heittola and T. Virtanen. A multi-device dataset for urban acoustic scene classification. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018). Nov. 2018, 9–13.URL:https:

//arxiv.org/abs/1807.09840.

[21] A. Pillos, K. Alghamidi, N. Alzamel, V. Pavlov and S. Machanavajhala. A real-time environmental sound recognition system for the Android OS.Proceedings of De-tection and Classification of Acoustic Scenes and Events(2016).

[22] H. Purwins, B. Li, T. Virtanen, J. Schlüter, S.-Y. Chang and T. Sainath. Deep learn-ing for audio signal processlearn-ing.IEEE Journal of Selected Topics in Signal Process-ing13.2 (2019), 206–219.

[23] S. Ravindran and D. V. Anderson. Audio classification and scene recognition and for hearing aids.2005 IEEE International Symposium on Circuits and Systems. IEEE.

2005, 860–863.

[24] S. Russel, P. Norvig et al.Artificial intelligence: a modern approach. Pearson Edu-cation Limited, 2013.

[25] D. Sarkar.A Comprehensive Hands-on Guide to Transfer Learning with Real-World Applications in Deep Learning. 2018.

[26] H. Tang, A. M. Scaife and J. Leahy. Transfer learning for radio galaxy classification.

Monthly Notices of the Royal Astronomical Society 488.3 (2019), 3358–3375.

[27] L. Torrey and J. Shavlik. Transfer learning.Handbook of research on machine learn-ing applications and trends: algorithms, methods, and techniques. IGI Global, 2010, 242–264.

[28] T. Virtanen, M. D. Plumbley and D. Ellis.Computational analysis of sound scenes and events. Springer, 2018.

[29] W. Wang.Machine Audition: Principles, Algorithms, and Systems. IGI Global, 2011.

[30] J. West, D. Ventura and S. Warnick. Spring research presentation: A theoretical foundation for inductive transfer.Brigham Young University, College of Physical and Mathematical Sciences1.08 (2007).

[31] J. Yosinski, J. Clune, Y. Bengio and H. Lipson. How transferable are features in deep neural networks?:Advances in neural information processing systems. 2014, 3320–3328.

[32] J. Zupan. Introduction to artificial neural network (ANN) methods: what they are and how to use them.Acta Chimica Slovenica41 (1994), 327–327.

In document Acoustic Scene Classification With L3 Embeddings: Transfer learning experiment (sivua 19-0)