Acoustic Scene Classification With L3 Embeddings: Transfer learning experiment

(1)

L3 EMBEDDINGS

Transfer learning experiment

Faculty of Information Technology and Communication Sciences (ITC) Bachelor’s Thesis May 2020

(2)

ABSTRACT

Joni Seppälä: Acoustic Scene Classification With L3 Embeddings Bachelor’s Thesis

Tampere University

Degree Programme in Computing and Electrical Engineering, BSc (Tech) May 2020

Countless audio data are recorded on a daily basis in different environments. Being able to recognize the context of the audio automatically would be beneficial in many context-aware systems such as hearing aids and smartphones. Although the audio data is abundant, the labelled audio data can be scarce in some domains. The objective of this thesis is to achieve the highest possible accuracy in the acoustic scene classification task using machine learning (ML), focusing primarily on a transfer learning approach with L3-embeddings - an approach that is robust for limited training data.

The thesis explores how well the L3-embeddings presented in the study Look, Listen and Learncan be applied to theTAU Urban Acoustic Scenes 2019acoustic scene classification challenge, and how the choice of the downstream classifier might affect the performance. The thesis presents a review of essential theories related to the considered approach, outlines the system that was implemented and compares the obtained results to the baseline, state-of-the-art and human performance.

The implemented system for the task includes training a model with either k-nearest neighbors (k-NN) or feed-forward neural network (FNN) classifier. Audio files are given as an input to theOpen L3library, which generates compressed features, called embeddings, based on them.

These embeddings are further given as an input to the system, which uses them for training and testing of the model.

The obtained results reveal that the chosen method works well. Although the hyperparameters of the model were not optimized, the FNN classifier achieved an average accuracy of 81 %, which is close to the state-of-the-art 85 % accuracy, with a much simpler model and only a small proportion of the used parameters. The results also indicate that the chosen classifier significantly affects the obtained accuracies. The average accuracy of the k-NN classifier was 76 %, which is notably less than that achieved by the FNN.

Keywords: Open L3, transfer learning, machine learning

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

Joni Seppälä: Ympäristön äänen luokittelija L3-upotuksilla Kandidaatintyö

Tampereen yliopisto

Tieto- ja sähkötekniikan TkK-tutkinto-ohjelma Toukokuu 2020

Ääntä nauhoitetaan päivittäin todella suuria määriä erilaisissa ympäristöissä. Nauhoitetusta ääniraidasta on usein mahdollista päätellä, millaisessa ympäristössä ääni on nauhoitettu. Tätä tietoa voi käyttää erilaisissa älykkäissä sovellutuksissa, kuten kuulolaitteissa ja älypuhelimissa.

Vaikka äänidataa onkin saatavilla runsaasti, nimilapuilla varustettua ääntä on niukasti saatavilla joillain aloilla. Työn tavoite on saavuttaa mahdollisimman suuri tarkkuus ympäristön luokittelu- tehtävässä äänen perusteella koneoppimista hyödyntäen, keskittyen erityisesti siirto-oppimiseen (engl. transfer learning) L3-upotusten (engl. L3-embeddings) avulla - tämä lähestymistapa on myös ihanteellinen suhteellisen pienelle koulutusdatalle.

Työssä tutkitaan, miten hyvin tutkimuksessaLook, Listen and Learn Morejaetut L3-upotukset soveltuvat käytettäväksiTAU Urban Acoustic Scenes 2019 –haasteessa. Haasteessa tarkoituk- sena on luokitella ääniraidat mahdollisimman tarkasti niiden todellisen nauhoitusympäristön perusteella. Lisäksi työssä tutkitaan, miten valittu luokittelija vaikuttaa saavutettuun suorituskykyyn.

Työssä esitellään valitun metodin kannalta olennaista teoriaa, pääpiirteittäin tehtävää varten luo- tua systeemiä ja saavutettuja tuloksia verrattuna lähtökohtaisen systeemin, parhaan kenenkään saavuttaman ja ihmisten tuloksiin.

Työtä varten kehitetyssä ohjelmassa on koulutettava malli ja algoritmit sen kouluttamiseen lä- himmän naapurin algoritmilla (engl. k-nearest neighbors, k-NN) tai eteenpäinsyötteisellä neuro- verkolla (engl. feed-forward neural network, FNN). Ääniraitoja ajetaan syötteiksiOpen L3 –kirjas- tolle, joka luo niistä kompressoituja ominaisuusmatriiseja (engl. features matrices). Ominaisuus- matriisit syötetään edelleen ohjelmalle, joka käyttää niitä mallin kouluttamiseen ja testaamiseen.

Saavutetut tulokset puhuvat valitun metodin puolesta. Vaikkei mallin hyperparametrejä oltu- kaan optimoitu, saavutti systeemi eteenpäinsyötteisillä neuroverkoilla keskimäärin 81 %:n tark- kuuden, joka on lähellä parasta kenenkään saavuttamaa 85 %:n tarkkuutta tehtävässä, paljon yk- sinkertaisemmalla mallilla ja vain murto-osalla käytetyistä parametreistä. Tuloksista selvisi myös, että valittu mallin koulutusalgoritmi vaikuttaa lopullisiin tuloksiin merkitsevästi. Lähimmän naapurin algoritmilla saavutettu tarkkuus tehtävässä oli keskimäärin 76 %, joka on selkeästi vähemmän kuin neuroverkoilla saavutettu tarkkuus.

Avainsanat: Open L3, siirto-oppiminen, koneoppiminen

Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck -ohjelmalla.

(4)

PREFACE

I would like to express my most profound appreciation to my thesis supervisor Annamaria Mesaros for her guidance and patience that cannot be overstated throughout the work with this thesis. I would also like to acknowledge the assistance of Toni Heittola, who helped me to solve the programming related problems that I could not overcome by my- self. I also wish to thank my friends and family, who supported me through the times of writing this thesis and gave me practical suggestions whenever I presented my work.

Tampere, 14th May 2020

Joni Seppälä

(5)

1 Introduction . . . 1

2 Background . . . 3

2.1 Machine listening . . . 3

2.2 Supervised learning . . . 4

2.3 Transfer learning . . . 5

2.4 Embeddings . . . 6

2.5 Classifiers . . . 6

2.5.1 k-nearest neighbors . . . 7

2.5.2 Feed-forward neural network . . . 9

3 Experiments . . . 12

3.1 System overview . . . 12

3.2 Dataset . . . 12

3.3 Feature extraction . . . 13

3.4 k-Nearest Neighbors . . . 13

3.5 Feed-forward Neural Network . . . 15

4 Discussion . . . 18

5 Conclusions . . . 21

References . . . 22

(6)

LIST OF SYMBOLS AND ABBREVIATIONS

CNN convolutional neural network

DCASE Detection and Classification of Acoustic Scenes and Events DNN deep neural network

FNN feed-forward neural network k-NN k-nearest neighbors

L3 Look, Listen and Learn ML machine learning MLP multi-layer perceptron RNN recurrent neural network SVM support vector machine

(7)

1 INTRODUCTION

We are surrounded by sound in everyday life. Certain sounds are more prevalent in some environments than in others. For example, the sound of a bird singing could make us think that we are situated in a park, not in a subway. Based on this knowledge, a system can be implemented to automatically analyze the audio and classify in which environment the listener is situated. In turn, this contextual information can be used in various context- aware applications, such as hearing aids and smartphones [23] [17].

Successful classifying of the audio into different contexts is a difficult task and, usually, to achieve a satisfactory performance, a lot of training data and a complex model are needed. However, complex models often imply high computational demand and the amount of labeled audio data might be scarce in some domains [22]. This thesis focuses on a transfer learning method in machine learning (ML) to implement a system that can achieve a high classification accuracy with a simple model even with data scarcity.

In the study [3] the authors developed and trained a system to solve the audio-visual correspondence learning task. The system consisted of a vision sub-network, audio sub-network and fusion layers. The authors then extracted features from the audio sub- network, which they call the L3-Net embeddings. These embeddings were used in a transform learning ML method to achieve a state-of-the-art classification accuracy on environmental sound classification (ESC-50) and detection and classification of acoustic scenes and events (DCASE) tasks.

A question is raised: Does the L3 generalize for other data sets? The authors in [6]

implemented an open-source library, called Open L3, that would generate embeddings for input audio files. They focused their research on how different design choices in the L3-Net affected the performance of downstream audio classifiers and reached state-of- the-art performance on UrbanSound8K dataset using mel-based L3-Net embeddings of the input representation with a simple 2-layer multi-layer perceptron (MLP). Based on this, there is evidence that the L3 embeddings could generalize for other tasks as well.

However, the authors did not explore the design choices made with the downstream classifier. In this thesis, two shallow classifiers, namely k-nearest neighbors and feed- forward neural network classifiers, are compared to understand how the choice of the downstream classifier might affect the system performance.

TAU Urban Acoustic Scenes 2019 task was developed for DCASE challenge (DCASE stands for Detection and Classification of Acoustic Scenes and Events) to compare audio

(8)

scene classification systems performance on real-life audio in different urban environments [20]. In this thesis, I try to validate that Open L3 works with this task as well, while comparing the "classical" and "modern" (deep) models’ performance.

The structure of the rest of this thesis is as follows: First, the theoretical foundations of relevant topics related to the thesis are considered in Chapter 2, then the implemented system for the TAU Urban Acoustic Scenes 2019 task is described in Chapter 3, after that the obtained results are analyzed in Chapter 4 and finally, conclusions of this thesis are presented in Chapter 5.

(9)

2 BACKGROUND

2.1 Machine listening

Many actions and events cause sound. A listening agent can interpret the properties of sound as information; be it a message, music or cues about the environment. When hearing a sound, people perceive it slightly differently, but similar to each other, and automatically try to extract knowledge about it. This process is quick and autonomous.

It is then natural that many applications have been created that try to achieve human performance and improve upon it on tasks that include interaction with sound. The field that studies, how machines can interact with sound is calledmachine listeningormachine audition. It refers to the collection of methods and algorithms that extract information from audio and make something useful with it [29].

Whereas human listener excels at dynamically adapting to the demands that complicated, unknown, varying and unlimited environments pose, many factors make it tough for a machine listening application to infer knowledge from received sound signal [29]. There can be multiple sources that emit sound with various interferences - the listener hears the combined stream of these sounds. Besides, naturally emitted sounds occur differently from event to event [7]. To attain a representation that is more suitable for analysis applications, some preprocessing is usually implemented to sound signals before analysis.

Sound signals are sparse and can be compressed to denser feature vectors [7]. Accord- ing to the study [9], following the factors mentioned above, machine listening applications can be divided into three phases: representation, alignment & comparison and recognition. Representation is about how the audio data should be represented, e.g. what kind of time-frequency representation should be used. Alignment & comparison is about removing overlap and interference. Recognition is the part where application-specific operations, such as analysis and classification, are executed.

This thesis considers mostly environmental sounds. They differ from music and speech by structure; while certain sounds are temporally continuous, like a sound of a car pass- ing by, these events can be modelled as separate acoustic events that are independent of each other, whereas in music all notes are heavily temporally linked to each other (melody, rhythm). Different events consist of natural (rain, animal noise) and artificial (loudspeaker, car) sounds [21]. Environmental sound recognition systems can take ad- vantage of the fact that the sounds can be segregated to different acoustic events. The

(10)

big picture of the audio scene can then be built upon the classification of the smaller time events since certain acoustic events occur more frequently in some environments than others. This way, different environments can be discriminated [7].

2.2 Supervised learning

Machine learning (ML) is usually divided into supervised, unsupervised and reinforce- ment learning. This thesis deals with methods considering supervised learning. A ML system that ought to learn needs some training data. Ideally, a classifier observing the training data results in its learning. Learning in a classification problem, such as is the case with this thesis, is usually referred to as a process that increases classifier’s prediction accuracy on unseen data similar to the learning data [8] [1]. Supervised learning is a machine learning method where the desired output for all the training data is given [24].

This way, the classifier can iterate through the training samples, and observe if its prediction for the data was correct or not, and given the prediction was wrong, it can observe how it was wrong as well. Mathematically, the classifier fits a function selected from the hypothesis space that predicts output for unseen data as well as possible [24].

Supervised learning can be divided into classification and regression. Classification is used when the ground truth for the labels is set to a finite space of solutions. Meanwhile, regression is used when the ground truth is a numerical value from a spesific range [24].

The problem studied in this thesis uses classification since the true labels consist of 10 disjoint classes.

The supervised learning process for a classification problem is presented in figure 2.1.

Prior to the process, the data should be divided to train and test samples, and the train samples should have some metadata which includes the information of the ground truth for the data. The classifier can be in one of its two stages: learning or testing.

train

encoding metadata encoder ground truth

data feature

extraction classiﬁcation

test data feature

extraction classiﬁcation

model

OR

Figure 2.1.Supervised learning process for a classification problem.

When the classifier is learning, the upper block is used. First, the training data features are extracted, meaning that some operations are carried out to modify the dimensionality of the original raw data to fit the problem description better and thereby help the classifier understand the data better. The data is referred to as features afterwards as they are usu-

(11)

data it has seen and compares it to the ground truth value. Depending on if the prediction was correct or not, the classifier makes adjustments to its parameters to better predict the samples in the future. This training process can be effective if the problem is not ill-posed, e.g. there is sufficient training data, the computational requirements of the classifier are met, and the classifier’s hyperparameters are favourably set [28] [5].

When the classifier is in the test mode, the testing data goes through feature extraction and classification, similar as the training data did in the learning mode. Here, however, the ground truth labels are not provided to the classifier, and therefore the model is not learning, but only the predictions are generated. These predictions can be compared to their ground truth values externally to measure the system performance on unseen data.

This is desired since the performance of the model is usually measured as the predicting accuracy for unseen data, not for the data which has been already shown [16]. The test data is something the system has not encountered in training, similar as a real use case.

2.3 Transfer learning

Transfer learning, or domain adaptation, is the idea of applying intelligence acquired from one task to solve related ones, in other words, to not solve a problem in isolation where accomplished information from similar domain already exists. As a machine learning method, a model developed for a source task is transferred to be used as the starting point of another model on a related target task. However, the task does not explicitly need to be related [10] [30]. If the second task is similar to the original task for which the model was developed, a range of benefits can be achieved in generating as good a model as possible - though the benefits are not evident in the domain until it is developed and evaluated. The importance of transfer learning as the next driver for ML success has been highlighted recently [25].

Transfer learning is to be considered when training of the model requires computationally and timely expensive calculations. This is why the method is widespread, especially in deep learning, in which the computational complexity is high. According to [27], the possible benefits of transfer learning are:

1. A head-start in the initial accuracy of the model

2. A faster rate of improvement in the accuracy of the model 3. Improved converging accuracy of the model

Another consideration that might promote the method of transfer learning is data abun- dance in a related source domain or data scarcity in the target domain. Data scarcity

(12)

could lead the prediction accuracy to never converge to an acceptable level since there were not enough data to learn the underlying phenomenon that caused the data [25]. In this case, the third possible benefit listed is emphasized.

Whether to apply transfer learning and for which source model is a challenging problem that usually requires domain knowledge and intuition developed via expertise [25]. The problem presented in this thesis uses transfer learning, where the transferred data is collected of a deep neural network (DNN). In DNNs, transfer learning has better prospects if the model features learned from the first task are general. The further the original domain is compared to the target domain, the less transferability is left. However, transfer learning for even distant domains can lead to improvements in performance compared to initially random features [31].

2.4 Embeddings

Look, Listen and Learn, later referred to as L3, is a proposed solution to an audio-visual correspondence learning task [3]. The L3’s solution introduced a combination of three deep neural networks; one for audio data, one for visual data and one for the fusion of the previous two. These layers were trained with self-supervised learning of audio-visual correspondence in videos, where self-supervised means that the learning data provided the supervision (see section 2.2 for more details on supervised learning). The model was trained with 400 000 instances of 10-second videos.

In the study Look, Listen and Learn More the authors introduced a transfer learning method to use the last audio layer of the deep neural network presented in the L3 project combined with audio files to acquire deep audio embeddings [6]. In the context of ML embeddings are "low-dimensional, learned continuous vector representations of discrete variables" [15] which can be used as feature vectors in training the ML model. The pur- pose of the embeddings was to allow training of other systems with training data that especially suffer from data scarcity, which was also considered in section 2.3.

The feature vectors contain elements that are a very compressed (dense) presentation of the original (sparse) audio data. This ensures that the learning of the downstream model is effective.

2.5 Classifiers

It is part of the human nature to classify things into different categories, for example, different living beings can be classified into animals, plants, mushrooms and bacteria.

This thesis proposes a solution which classifies environmental sounds into different categories. Statistical classification is the problem of identifying to which of a set of categories an observation belongs to depending on a training set of previous observations whose

(13)

mathematical tasks, and they include such algorithms as k-nearest neighbors (k-NN) and support vector machines (SVM) [4]. Although modern methods are considered most im- portant by today’s standards, the classical methods are still used in some applications.

Modern classification methods are referred to as methods that use neural networks in this thesis. Though the fundamental theories of perceptrons were also formed in the 50s, these methods have become popular in recent years thanks to the increased performance on various ML problems. This, in turn, has been due to increased availability of computational power for deeper model architectures [10].

Neural networks can be further divided into feed-forward neural networks (FNN), convolutional neural networks (CNN) and generative adversarial networks (GAN), among others. The L3-embeddings were tested on k-NN and FNN since both methods are ca- pable enough for demonstrating the L3-embeddings in action, and they both are one of the simplest of their kind (classical, modern). Additionally, both of them performed well on initial tests.

2.5.1 k-nearest neighbors

In machine learning, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. It stores all training data and classifies test data on a similarity measure (e.g., distance functions). Non-parametric methods belong to the branch of non-parametric statistics where the data is not modelled with a probability distribution [2]. Instead, the shape of the distribution is estimated under the form of statistical measurement. This rejection of the probability distribution assumption makes non-parametric methods more applicable in situations where little is known about the application. However, non-parametric methods suffer from a lower degree of confidence than parametric methods in cases where the distribution of the data is known [11].

The k-NN is an example-based algorithm, meaning that the predictions it makes for test data are solely based on the exact properties of the training data. On supervised learning, solving the classification problem, this classifier design is straightforward to implement.

In the training phase, the classifier is trained with all the training samples and in the testing phase, each test sample is compared against all training samples. The predicted label is the most common label occurring in the k nearest training samples. The most common label is usually solved via majority voting, which means that if the ground truth labels for the training k nearest neighbors differ, the most popular ground truth among the neighbors is picked as the prediction. This leads to jagged decision boundaries [8], as illustrated in figure 2.2 for randomly generated data points plotted in 2D space.

(14)

Figure 2.2. 1-NN for 4 class data, generated using tool in¹.

The nearest neighbors of a given test sample are the samples that have the smallest distance to it. The distance can be computed with different metrics - in this thesis, Euclidean and Minkowski metrics are presented. When the distance between the test sample and the train sample is measured, the samples can be presented as vectorsaandb. When the samples are ind dimensions, the Euclidean formula can be presented as

D(a,b) = (

d

∑

k=1

(a_k−b_k)²)^1/2 (2.1)

In the thesis, the Scikit’s k-NN default metric of Minkowski distance is used instead. For d-dimensional patterns it can be presented as

L_k(a,b) = (

d

∑

i=1

|a_i−b_i|^k)^1/k (2.2)

Though the k-NN classifier has a simple implementation, the computational burden of its algorithms grows massively as the amount and volume of the training data grow. Sup- pose there are n training samples in d dimensions, each distance calculation is O(d) asymptotically, and thus the search isO(dn²)[8].

k-NN was chosen to represent the classical ML classifiers since the embeddings already

1vision.stanford.edu/teaching/cs231n-demos/knn/

(15)

less likely (than for more complex models).

2.5.2 Feed-forward neural network

Artificial neural networks (ANN; NN for "neural network") were inspired by biological neural networks that loosely mimic animal brains. The architecture of ANN consists of neural units, or perceptrons, that are connected, meaning that they can transmit signals to other perceptrons. They have different properties, usually consisting of weights for different inputs, bias and activation function. These properties modify the signal received from earlier neural units [32]. Feed-forward neural network (FNN) is an artificial neural network wherein connections between the neuron units do not form cycles. In FNN, signals travel through the layers only in one direction - from the first (input) layer to the middle (hidden) layers to the last (output) layer [10]. It should be noted, that FNN that have more than one layer are often called a multi-layer perceptron (MLP) - in this thesis, I stick to using the FNN term.

A general learning process of a FNN is illustrated in figure 2.3 and is briefly discussed in this section.

Random initialization

Feed forward Calculate loss function

Calculate the derivative of error Backpropagate

Update the weights Weights

Inputs Ground truth Loss function

Learning rate Iterate

Start

Figure 2.3.FNN learning process.

(16)

First, the weights of perceptrons are initialized. There are many ways of doing it, zero- and random initialization being the most basic methods. The weights could also be inher- ited from another model; this method is related to transfer learning [26]. Second, forward propagation is conducted - the NN predicts a label for given input data. The perceptrons forward signals, modified with the activation function used, to the next layer of perceptrons, until the final layer is reached, which makes predictions with the modified signals.

The predictions are then compared against ground truth labels for the data - a loss function is calculated based on differences between the predictions and the ground truths.

Next, a derivative of the loss function is backpropagated in reverse in the layers, based on gradient descent. The weights of the perceptrons are updated with the derivative of the loss function, according to the learning rate. The whole process is iterated until convergence of the results is achieved, or the process is manually interrupted. In this thesis, convergence is determined via early stopping, which terminates the iteration when the accuracy of the predictions on validation data has not increased for a certain amount of iterations.

The activation function is explained here in more detail. To train deep neural networks that can learn non-linear decision boundaries, a non-linear activation function is needed. This allows for complex relationships in the data to be learned [8]. In essence, the activation function converts the input signal to an output signal [13]. The activation function can be sigmoid, rectified linear unit (ReLU) or softmax, among others.

The FNN in this thesis uses ReLU activation function for connections between perceptrons and softmax activation function for the final layer. ReLU is a piece-wise linear hidden unit [10] - a calculation that returns the input value if it is greater than zero, or zero if the value is less than zero. Softmax is used in multi-class single-label classification problems in neural networks, such as the one considered in this thesis, to map the output of a network to a probability distribution over possible output classes - it is the activation function of the final layer of the neural network. In [8] it is defined as

zk= e^net^k

∑c

m=1e^net^m (2.3)

The result is then transformed to 1 for the maximum output class, and 0 for other classes.

This encoding follows a "winner takes it all" principle - the most probable class is chosen as the prediction. The software implementation of the model obliges implementing a one- hot encoding where the result is encoded to a vector of zeros and a single one for the most probable class.

FNN was chosen to represent the modern (deep) classifiers since it has a relatively simple design for a neural network. In turn, the study [6] used it as well in transfer learning for a similar problem and obtained state-of-the-art results. It also gives perspective when comparing against the k-NN classifier, since training a deep model such as FNN, like any neural network, is more expensive computationally than the training of most of the classical methods. Benefits of choosing FNN as the classifier are apparent if the resulting

(17)

This idea is not addressed any further in this thesis.

(18)

3 EXPERIMENTS

3.1 System overview

The complete system implemented in this thesis, simplified, is presented in figure 3.1.

audio L3 audio

subnetwork

dimensionality reduction

(k-NN only)

classication

embeddings

feature extraction

airport bus metro metro_station park public_square shopping_mall street_pedestrian street_traﬃc tram

if

train

ground truth labels

Figure 3.1. The complete system implemented, simplified.

First, feature extraction was carried out. Audio samples were given as an input to the audio subnetwork of L3, which produced deep audio embeddings. These embeddings’

dimensionality was reduced if k-NN was used. The embeddings were given as an input for the k-NN and FNN classifiers. Having trained the classifiers, the classification accuracy for novel audio embeddings was tested with them. The test results were used to generate confusion matrices to help judge better how the system performed. Finally, a submission file for Kaggle competition was generated with the predictions.

A Python program was written to implement all the functionality necessary to conduct the experiments described in this thesis. The program uses Python version 3.7 with NumPy, scikit-learn and Keras libraries, and the source code can be found in¹

3.2 Dataset

The dataset consists of recordings from ten different acoustic scenes in ten populous European cities in various locations; 5-6 minutes of audio was recorded in each location.

The acoustic scenes are namely airport, shopping mall, metro station, pedestrian street, public square, street with an intermediate level of traffic, travelling by tram, a bus or an underground metro, and urban park. The data was recorded using electret binaural

1https://gitlab.com/habbisify/audio-environment-classification-with-l3-embeddings

(19)

as separate files, totalling 40 hours of audio in 14400 segments. The segments were divided 10215/4185 into training and testing using a provided fold for uniform reporting of results among researchers using this data [20].

3.3 Feature extraction

Feature extraction step generates a compact representation of the embeddings for classification. Well-chosen features help discriminate different classes to achieve small intra- class variability and large inter-class variability.

In the feature extraction step (see figure 3.1), audio samples are given as an input for OpenL3, which is an open-source Python library for computing deep audio embeddings [6] and acts as an audio sub-network. Using OpenL3’s API, all the 10-second sound clips were transformed to NumPy arrays - these arrays are called the embeddings. OpenL3’s input representation was set to linear, content type was set to environmental and embedding size of 512 was used. As stated in [6], mel spectrograms probably perform better than linear spectrograms; the choice of the hyperparameters overlooked this detail. Em- bedding size of 6144 could have been selected instead of 512 - this would have probably resulted in slightly better features, but the trade-off would have been much longer training times and storage problems.

For k-NN, feature extraction was followed by vectorization by average and standard de- viation over the time axis to compress the data. This is done because, as mentioned in section 2.5.1, the k-NN is computationally very demanding for more extensive data. For FNN, no operations were done to reduce the data dimensions. Since the raw data was so plentiful, the computational complexity of k-NN would have resulted in the program (practically) never to finish, so the vectorization was needed to compress the data. In return, FNN could benefit from more complex data to reach better classification accuracy, and therefore raw data was used.

3.4 k-Nearest Neighbors

The k-NN classifier was trained with the number of neighbors being 1 (making it 1-NN), uniform weights, automatic computing algorithm and Minkowski metric.

Since the results varied slightly with each program execution, it was run five times, and the results were then averaged. The average overall accuracy was 76 %. The confusion matrix for the average accuracies is shown in figure 3.2.

(20)

Figure 3.2. Confusion matrix for the average k-NN predictions.

The k-NN learned to distinguish most of the categories quite well. Some of the categories were easier to learn than the others - airport, park, shopping_mall and street_traffic had classification accuracy of at least 90 %, while metro, street_pedestrian and tram had an accuracy of less than 70 %. Particularly metro was often confused with metro_station, which seems logical since their environments overlap each other. It is also interesting to point out that the confusion did not happen to the opposite direction as often, that is metro_station was not confused with metro. Metro and tram were also confused with more than 10 % of the samples - again this seems rational as a sound of a metro with passengers could resemble that of a tram. The average performance of 70 - 90 % was achieved for categories bus, metro_station and public_square.

(21)

tecture was used because the simple design of the network was observed to work in the study for a similar problem. The network consists of 3 dense layers of 512, 128 and 10 perceptrons, respectively, with ReLU activation for the first two layers and softmax activation for the final layer, as shown in figure 3.3. This amounts to 67,064 parameters, which is very low in the domain of deep learning. Uniform kernel initializer was used, and the bias was initialized to zero. The model uses binary cross-entropy loss, Adam optimizer and accuracy as the metric, based on which early stopping is conducted.

embeddings dense 512 dense 128 dense 10 softmax 10

prediction

FNN

Figure 3.3.FNN layer configuration.

The train set was split into train and validation sets by randomly splitting the data with a proportion of 70/30, with a constraint that no samples with a given group identification label belonging to the training set should be included in the validation set. This is im- portant since the label defines the location where the audio sample was recorded - the validation is performed with disjoint location audio data. Based on the validation set, it was possible to recognize if the FNN was starting to overfit. The FNN was trained with a batch size of 512 for arbitrary epochs until the early stopping would commence. The early stopping parameters used were minimum delta of 0, a patience of 20 epochs and a baseline requirement of 60 %. Practically this resulted in training lasting for between 250 and 700 epochs. Obviously, the chosen hyperparameters could be optimized further.

The chosen values weread hoc, based on a few iterations with different values. A proper optimization would use more advanced methods such as grid search.

Since the embeddings were of raw data, meaning that each embedding represents one frame of the 10 seconds audio clip, a prediction was made for each frame. The final class for the audio clip was then decided with a majority vote among the individual classes to predict the most often predicted class.

The accuracy of the FNN model while training is shown in figure 3.4.

The graphical representation for the training process shown is for an individual execution, but it represents all executions well. Epochs trained grows from left to right, and

(22)

Figure 3.4. FNN model accuracy over epochs trained, randomly selected plot from the five runs.

the accuracy of the model grows from down to up. The blue line represents the model accuracy for the training data; the orange line for the validation data. The model accuracy on the training data improves logarithmically as the number of epochs increases, but the accuracy on the validation data seemingly saturates on the earlier epochs.

As with the k-NN, results slightly varied each program execution, and the program was run five times. The results were then averaged, resulting in an average overall accuracy of 81%. Therefore the model accuracy in figure 3.4 seems unusually high and does not represent the actual classification accuracy for the system with the test data, based on the fact that actual results at the end of the training are much lower. Possibly, the train/validation split resulted in a combination that allows this high performance. The confusion matrix for the average accuracies is shown in figure 3.5.

The FNN learned to discriminate the categories very well. Categories airport, bus, park, shopping_mall and street_traffic scored above 90 % accuracy, with bus’ accuracy being highest among all classes, at 97 %. Below 70 % accuracy was scored by metro, pub- lic_square and street_pedestrian. The FNN also confused metro often with metro station and tram which, as stated in section 3.4, seems logical. The average performance of 70 - 90 % was achieved for categories metro station and tram. It seems like the FNN tended to learn some categories well while a few categories’ classification accuracies remained low.

(23)

Figure 3.5. Confusion matrix for the average FNN predictions.

(24)

4 DISCUSSION

The classification accuracies of the baseline system (provided in the DCASE challenge), k-NN and FNN are presented in table 4.1.

Table 4.1. Summary of the results for comparison between the baseline, k-NN and FNN.

baseline (ACC [%]) k-NN (ACC [%]) FNN (ACC [%])

airport 50 91 94

bus 61 81 97

metro 68 38 49

metro_station 52 77 86

park 91 92 92

public_square 35 72 67

shopping_mall 69 92 91

street_pedestrian 53 58 65

street_traffic 86 90 94

tram 69 67 77

Overall 63 76 81

Both classifiers that used OpenL3 embeddings had a significantly better overall average accuracy than the baseline system. The FNN classifier also outperformed the k- NN. Class-wise performances of k-NN and FNN were quite similar although FNN scored higher or similar accuracies for all the categories except public square. Interestingly, the baseline system got much higher accuracy in the metro category compared to the k-NN and the FNN.

The baseline system used log mel-band energies of 40 ms windows with 50 % hop size as the features, trained with a network consisting of two convolutional neural network (CNN) layers and one fully connected layer [12]. These are not discussed in this thesis, but it is sufficient to say that the log mel-band energies are a feature representation that approximate the human auditory system’s response closer than the linearly-spaced frequency bands and therefore could allow for better representation of sound. In contrast, linear representation of audio was used with the L3-embeddings. The disparity of the features with baseline system versus k-NN or FNN may explain why they also perform very differently - for example, the baseline system seems to work better for metro class, but it struggles to discriminate airport, where the k-NN and FNN seem to excel.

(25)

the task at hand and is also very simple, with only 67 K parameters. In comparison, the top system is an ensemble of 7 CNNs with 48 M parameters. Based on these results, it seems plausible that the OpenL3 embeddings can be used in transfer learning to achieve high performance in audio-related tasks, which is the same observation that the authors in [6] made.

It is beneficial to compare the performance of the classifiers against a human equivalent to understand better how good the system actually is at recognizing different auditory scenes. Unfortunately, at the time of writing this thesis, there were no statistics on human performance on the task’s data set. However, the human performance was tested with somewhat similar data in the TUT Acoustic Scenes 2016 dataset, where there were 30 seconds samples of 15 different categories [19]. The subjects were first familiarized briefly with some examples of different categories, and then they undertook the tests.

The obtained confusion matrix of the participants’ accuracies is presented in figure 4.1.

Figure 4.1. Human performance on the audio environment classification task in the Acoustic Scenes 2016 dataset.

As can be seen, in the TUT Acoustic Scenes 2016 dataset there were a few categories that are the same as in TAU Urban Acoustic Scenes 2019. However, the straightforward comparison between the performances in these two challenges is not sensible because they operate on different data. The subjects had difficulties in discriminating many of the acoustic scenes, and on average didn’t perform over 90 % on any of the categories.

On average they had a total accuracy of 54 %. It is to be noted that the study also

(26)

featured an expert listener, who had been training explicitly with the data before testing - they had an average accuracy of 77 %. In comparison, the baseline and state-of-the-art implementations for the challenge scored with the average accuracy of 77 % and 90 %, respectively [18]. It can be safely stated that the machine learning approaches usually outclass the human performance on audio environment classification in these closed experiments.

(27)

5 CONCLUSIONS

This thesis proposed a transfer learning method using L3 embeddings in the TAU Urban Acoustic Scenes 2019 audio environment classification problem. A system that used k-NN and FNN classifiers was implemented for this task. Over five runs, the k-NN and the FNN models achieved an average accuracy of 76 % and 81 %, respectively. These performances were achieved without extensive optimization of hyperparameters or layer configuration in the latter case. They outperform the baseline system which had the average accuracy of 63 % and come close to the state-of-the-art implementation which had an accuracy of 85 %, with much simpler model - in the case of the FNN with only 0.14 % of the weights.

Based on the obtained results, it seems that the L3-embeddings generalize well for different various audio domain downstream tasks. Models that utilize deep learning can achieve better results than simpler ones with these embeddings - here the FNN model reached 5 % higher classification accuracy than the k-NN model. Transfer learning with the L3-embeddings could yield even better results with implementations optimized for the downstream task, perhaps competing with the state-of-the-art systems or even outper- forming them. This task is left open for future research.

(28)

REFERENCES

[1] E. Alpaydin.Introduction to machine learning. MIT press, 2020.

[2] N. S. Altman. An introduction to kernel and nearest-neighbor nonparametric regression.The American Statistician46.3 (1992), 175–185.

[3] R. Arandjelovic and A. Zisserman. Look, listen and learn.Proceedings of the IEEE International Conference on Computer Vision. 2017, 609–617.

[4] J. Boelaert and É. Ollion. The Great Regression. Revue française de sociologie 59.3 (2018), 475–506.

[5] M. Claesen and B. De Moor. Hyperparameter search in machine learning. arXiv preprint arXiv:1502.02127 (2015).

[6] J. Cramer, H.-H. Wu, J. Salamon and J. P. Bello. Look, listen, and learn more:

Design choices for deep audio embeddings.ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2019, 3852–3856.

[7] J. W. Dennis. Sound event recognition and classification in unstructured environments. PhD thesis. PhD thesis, Nanyang Technological University, 2011.

[8] R. O. Duda, P. E. Hart and D. G. Stork.Pattern classification. John Wiley & Sons, 2012.

[9] D. P. Ellis. A history and overview of machine listening. (2010).

[10] I. Goodfellow, Y. Bengio and A. Courville.Deep learning. MIT press, 2016.

[11] M. Grant.Nonparametric Statistics: Overview. 2019.

[12] T. Heittola. DCASE2019 Challenge Task 1 baseline system. 2019. URL: https : //github.com/toni-heittola/dcase2019_task1_baseline.

[13] K. Hinkelmann.Neural Networks, p. 7. 2018.

[14] G. Hinton. Advanced Machine Learning, Lecture 10: Recurrent neural networks.

2013.

[15] W. Koehrsen. Neural network embeddings explained.Towards Data Science, via Medium, October 2 (2018).

[16] M. Kuhn and K. Johnson.Applied predictive modeling. Vol. 26. Springer, 2013.

[17] N. D. Lane, P. Georgiev and L. Qendro. DeepEar: robust smartphone audio sensing in unconstrained acoustic environments using deep learning. Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing.

2015, 283–294.

[18] A. Mesaros, T. Heittola, E. Benetos, P. Foster, M. Lagrange, T. Virtanen and M. D.

Plumbley. Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing26.2 (Feb. 2018), 379–393.ISSN: 2329-9290.DOI:10.1109/

TASLP.2017.2778423.

(29)

[20] A. Mesaros, T. Heittola and T. Virtanen. A multi-device dataset for urban acoustic scene classification. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018). Nov. 2018, 9–13.URL:https:

//arxiv.org/abs/1807.09840.

[21] A. Pillos, K. Alghamidi, N. Alzamel, V. Pavlov and S. Machanavajhala. A real-time environmental sound recognition system for the Android OS.Proceedings of De- tection and Classification of Acoustic Scenes and Events(2016).

[22] H. Purwins, B. Li, T. Virtanen, J. Schlüter, S.-Y. Chang and T. Sainath. Deep learning for audio signal processing.IEEE Journal of Selected Topics in Signal Process- ing13.2 (2019), 206–219.

[23] S. Ravindran and D. V. Anderson. Audio classification and scene recognition and for hearing aids.2005 IEEE International Symposium on Circuits and Systems. IEEE.

2005, 860–863.

[24] S. Russel, P. Norvig et al.Artificial intelligence: a modern approach. Pearson Edu- cation Limited, 2013.

[25] D. Sarkar.A Comprehensive Hands-on Guide to Transfer Learning with Real-World Applications in Deep Learning. 2018.

[26] H. Tang, A. M. Scaife and J. Leahy. Transfer learning for radio galaxy classification.

Monthly Notices of the Royal Astronomical Society 488.3 (2019), 3358–3375.

[27] L. Torrey and J. Shavlik. Transfer learning.Handbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI Global, 2010, 242–264.

[28] T. Virtanen, M. D. Plumbley and D. Ellis.Computational analysis of sound scenes and events. Springer, 2018.

[29] W. Wang.Machine Audition: Principles, Algorithms, and Systems. IGI Global, 2011.

[30] J. West, D. Ventura and S. Warnick. Spring research presentation: A theoretical foundation for inductive transfer.Brigham Young University, College of Physical and Mathematical Sciences1.08 (2007).

[31] J. Yosinski, J. Clune, Y. Bengio and H. Lipson. How transferable are features in deep neural networks?:Advances in neural information processing systems. 2014, 3320–3328.

[32] J. Zupan. Introduction to artificial neural network (ANN) methods: what they are and how to use them.Acta Chimica Slovenica41 (1994), 327–327.