Continual Learning In Automated Audio Captioning

(1)

Jan Berg

CONTINUAL LEARNING IN AUTOMATED AUDIO CAPTIONING

Faculty of Information Technology and Communication Sciences M. Sc. Thesis November 2021

(2)

Jan Berg: Continual Learning in Automated Audio Captioning M.Sc. Thesis

Tampere University

Master’s Degree Programme in Computer Science November 2021

Teaching neural network models to classify new tasks and old tasks on new domains is a process, where a common problem is the forgetting of previous tasks and/or domains.

This problem is referred to as catastrophic forgetting. Continual Learning, which is sometimes also called Lifelong or Incremental learning, is a research field that aims to find solution to catastrophic forgetting in order to create models which can learn new tasks sequentially without need for fully retraining. As neural networks and machine learning in general has gained plenty of interest during last decade, the need to achieve these kind of continual learning methods have become apparent to many developers and researchers. Thus, continual learning has become quite a hot topic within neural networks research and many new methods have been introduced during the past 5 years.

Audio Captioning is a problem domain, where the goal is to generate textual presentation of audio data, effectively describing what is heard in the audio. Most current state- of-the-art Audio Captioning models are based on encoder-decoder type structures, with more recent models based on the attention-based Transformer model that has gained huge popularity and achieved state-of-the-art results in many different domains of tasks, such as in Neural Ma- chine Translation.

This thesis presents basic concepts to neural networks as well as to Continual Learning. Different types of Continual Learning methods are presented with small overview for most popular methods explained.

Furthermore, this thesis presents the first study of Continual Learning in Audio Captioning using Learning Without Forgetting Continual Learning approach. Using the approach resulted in some degree of alleviation in catastrophic forgetting while further training an Audio Captioning model Wavetransformer.

Key words and terms: machine learning, neural networks, continual learning, lifelong learning, automated audio captioning

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

1 Introduction ... 1

2 Artificial Neural Networks ... 3

2.1 Artificial Neurons 3

2.2 Activation Functions 4

2.3 Feed-forward neural networks 6

2.4 Training, backpropagation and loss functions 8

2.5 Convolutional Neural Networks 9

2.6 Recurrent Neural Networks 15

2.7 Sequence to Sequence, Attention and Transformers 16 2.8 Network initialization, data pre-processing and regularization 19 3 Continual Learning ... 21

3.1 Definition and motivation 21

3.2 Catastrophic Forgetting 22

3.3 Continual Learning methods 24

3.3.1 Regularization based approaches 25 3.3.2 Rehearsal based approahces 27

3.3.3 Dynamic Architectures 28

3.3.4 Combination-based approaches 29

3.4 Evaluation of Continual Learning 30

4 Automated Audio Captioning ... 32 4.1 Audio data pre-processing and feature extraction 33

4.2 Evaluation of Audio Captioning 36

5 Regularization based Continual Learning method for Audio Captioning ... 38

5.1 Learning Without Forgetting 38

5.1.1 Knowledge Distillation 38

5.1.2 Learning Without Forgetting for Audio Captioning 40

6 Evaluation ... 45

6.1 Wavetransformer model 45

6.2 Datasets and pre-processing 47

6.3 Training, hyper-parameters and evaluation 48

7 Results ... 50 8 Conclusion ... 53

(4)

1 Introduction

In the recent years research as well as usage of the machine learning and neural networks has grown to be a hot topic. While the used methods have been progressing fast, providing widely used models and methods to many different domains of problems, one common issue with neural networks has been the degradation of accuracy on older tasks when a previously trained model is further trained on new data. This phenomenon is called cata- strophic forgetting [30][57][23]. However, as in the real-world scenarios the distribution of data on most domains changes over time, the need for continually adapting to these changes becomes apparent. In many cases the current go-to approach in circumventing the issue of catastrophic forgetting is to simply re-train the whole model from scratch using more up to date and complete datasets accumulated over time [23]. There are however few problems that are apparent to this approach. First, teaching neural network models can be extremely time consuming, with ever expanding datasets only making the problem even bigger over time. Second, it is not given that the dataset(s) used to retrain a model has samples from all the domains/tasks the previous version of the model had knowledge of. This can lead to loss of knowledge after retraining, depending on the datasets used. Thus, considering these facts, the need for models that are capable of auton- omously adapting to changes without complete retraining becomes a very desirable trait for a model. This need forms the main motivation of Continual Learning.

During recent years Continual Learning has gained a lot of interest amidst researchers, and plethora of methods have been presented that try to alleviate the effects of catastrophic forgetting. While the research of Continual Learning has progressed and methods that are able to combat catastrophic forgetting to some degree have been proposed, the field could still be considered to be in its infancy. This becomes apparent when looking at many of the most popular ways of teaching neural network models, where Continual Learning methods are still rarely used, thus leaving Continual Learning and its methods mostly to the context of research specifically targeted to Continual Learning. One reason might be the fact, that currently the widely used frameworks like Pytorch and Tensorflow do not seem to provide Continual Learning methods to developers. There are some signs of this changing though, as for example this year (2021) the first library specifically aimed to provide easy to use implementations of Continual Learning methods called Avalanche [54] was released. Thus, maybe in near future Continual Learning might become more accessible and less intimidating for wide range of developers.

(5)

This thesis provides basic insight to Continual Learning and related methods and presents the first case study of Continual Learning in the context of Automated Audio Captioning (AAC) [9].

Automated Audio Captioning [26] is a fairly recently presented task for neural networks, where the goal is to generate textual presentation (i.e. caption) of an input audio. The task is highly related to Neural Machine Translation (NMT) [8] tasks and Image Captioning [82] and the state-of-the-art models follow the similar kind of encoder-decoder architectures as many of the NMT and Image Captioning models do. A caption is simply a generated sentence that tries to explain what is present and happening in a given audio sequence. Thus, the main difference to speech-to-text is that the goal is to not render a spoken sentence in text, but to describe the scene instead. In this sense, the example of using a spoken sentence as an input to an AAC model the output could be something like

“a person is speaking”, instead of the spoken sentence in textual form.

The structure of the thesis is as follows. First in Chapter 2, the basic theory of neural networks is presented. Chapter 3 describes the theory as well as motivation behind Con- tinual Learning. Some Continual Learning methods are also presented. Automated Audio Captioning and related concepts of audio features used in neural networks for audio processing are presented in Chapter 4. Chapters 5 through 7 present a continual learning method called Learning Without Forgetting [49] applied to a AAC model, Wavetrans- former [79], and the related results evaluated using two different AAC datasets, Clotho [25] and AudioCaps [42]. Finally, conclusions are presented in Chapter 8.

(6)

2 Artificial Neural Networks

Artificial Neural Networks (ANN) are a type of machine learning method drawing inspi- ration from the learning abilities of human brains and neurons [34]. The basic principle of ANN is based on computational units often called neurons or nodes [34][77][21]. In the simplest form an ANN model can consist of only a single node [68], however modern ANN models usually of multiple nodes grouped in multiple layers, where the connections between nodes are called weights [34][77]. This effectively results in a potentially huge and complex graph like structure consisting of multiple input-output layers which forms the ANN model. Often these multi-layered models are referred to as deep neural networks (DNN), which account for many of the current state-of-the-art ANN models [70].

Ultimately, the neural network model learns to approximate a function that represents a given task [21]. This learning is achieved by iteratively finding such weight values (using gradient descent referred to as backpropagation) that form the desired function approxi- mation (i.e. accuracy of classification) [34][21].

As ANNs excel in learning patterns within data, during recent years the ANN based machine learning models have become extremely popular and are often used in different kinds of classification tasks that are dependent on high number of variables and their relationships (i.e. patterns) [21]. Few examples where the applications for ANN have been very successful are in tasks related to different fields such as medical sciences [5], natural language processing (i.e. machine translation) [32] and image classifying [19].

This Chapter will focus on presenting the basics of neural networks and their training as well as few popular types of neural networks are presented.

2.1 Artificial Neurons

To understand how ANNs work, it is imperative to first understand how the basic units of ANNs, artificial neurons or nodes, work. Simply, a neuron is a unit that could be considered to consist of three distinct processes: multiplication of inputs (with weights), sum- ming and activation [44]. As shown in the Figure 1, a neuron takes a set of N inputs (and often an additional bias term), calculates the weighed sum of said inputs using weight parameters which is then fed as an input to the activation function which generates the node’s output [34].

(7)

Figure 1: A single neuron, where X are inputs and w corresponding weight values.

Weighed sum of inputs are fed to activation function that produces neuron’s output Y.

Considering the points above, given inputs 𝑥₁, 𝑥₂, 𝑥₃…𝑥_𝑛 with corresponding weight values 𝑤₁, 𝑤₂, 𝑤₃ … 𝑤_𝑛, bias b, and an activation function f the output of a neuron is as follows [34][21]:

𝑓(𝑥, 𝑤) = 𝑓(∑^𝑛_𝑖=1𝑥_𝑖𝑤_𝑖 + 𝑏) (2.1)

2.2 Activation Functions

In previous section it was explained that a neuron’s output is the result of the activation function using the weighed sum of inputs as the input to the function, one might wonder what the activation function looks like then.

Starting from the earliest activation function used Perceptron networks [68] (essentially a single node networks), a simple step function producing binary output of usually either (0, 1) or (-1, 1) was often used [34][21]. What this means that a node has two possible outputs determined by whether the weighed sum of the input to the node is over a given threshold (usually 0). Equation 2.2 shows a step function with outputs of (-1, 1) with the threshold of 0. [34]

𝑓(𝑥) = { 1, if 𝑥 ≥ 0

-1, if 𝑥 < 0 (2.2)

(8)

However, when it comes to modern ANN models, threshold functions are rarely used.

Instead, in order to add non-linearity to the function that the ANN ultimately tries to approximate, other non-linear activation functions are used. Moreover, the process of backpropagation (see Section 2.4) used to update weight values during training requires that the activation functions used are differentiable. One example of such activation function is the sigmoid function (Equation 2.3). Which limits the output of the neuron to the range of 0 to 1. [34][77]

𝑓(𝑥) = ¹

1 + 𝑒^−𝑥 (2.3)

While the sigmoid was quite widely used in ANN models before, it has now mostly been substituted with ReLU (Rectified Linear Unit), which has been shown to achieve better results than sigmoid while also being computationally more lightweight. Thus, many current ANN models rely on ReLU as their activation function. As shown in the Figure 3, ReLU produces nearly linear activation where the output is equal to the total input except for negative inputs that are set to output 0. While popular and effective, ReLU also has its share of problems. Namely, ReLU is only differentiable when x is larger than zero.

Because of this, there is a possibility of a disappearing gradient when the input to the activation is x ≤ 0. To combat this issue, a method called Leaky-ReLU has been used that introduces extremely small outputs to negative inputs instead of zero, enabling further training of the node even in the case the input is negative. [62]

Figure 2: Sigmoid function.

(9)

𝑓(𝑥) = max(0, 𝑥) (2.3)

𝑓(𝑥) = { 𝑥, if x > 0

0.01𝑥, else (2.4)

Besides the activation functions presented above, there exist a plethora of other activation functions, such as Tanh, Softsign and further variants of ReLU [62].

Furthermore, during recent years there has been some interest in development of learna- ble activation functions. What learnable activation function essentially means, is that in- stead of using a pre-defined activation function with clear known shape, the activation function is learned and “formed” during training. In some studies, parametrized learnable activation functions have shown improvement over classic activation functions. Moreo- ver, Apicella [7] argues that learnable activation functions can have similar results to adding layers to a network, thus potentially making learnable activation functions a way to reduce the amount of needed neural resources.

2.3 Feed-forward neural networks

While a single node itself is a computational unit, capable of some degree of linearly separable classification, modern ANN models usually consist of (sometimes huge [51]) number of connected nodes separated to multiple layers, referred to as feed-forward neu- ral networks (FFNN). FFNNs are sometimes also referred to as multilayer perceptrons (MLP). [34][77]

Figure 3: ReLU activation function Figure 4: Leaky-ReLU activation function

(10)

FFNNs consist of three different types of layers, namely the input layer, output layer and hidden layers, where data moves in sequential fashion through layers and their activa- tions. As the name would suggest, the input layer is the layer that handles the initial input to the network, thus being the first layer in the network. Output layer is the last layer of the network and generates the final output of the network. This means that most of the layers in an ANN model are usually the intermediate hidden layers. Figure 3 shows a simple (fully connected) FFNN with an input layer, two hidden layers and an output layer.

[34][75]

The hidden layers are vital to the performance of modern ANN models, as they are used to extract detailed features present within the input data, which are ultimately used by the output layer to determine the output. As hidden layers extract features, this naturally means that consecutive layers further extract features from extracted features, leading to the network capturing finer details within the data in the later layers. The number of hidden layers and nodes needed to build a good model is dependent on the task. [21]

The FFNN is formed by the layers being connected in sequential manner to the previous layer, with weights presenting the strength of its connections. Each layer takes as the input the output of the previous layer. More specifically in fully connected network (most ANNs models fall to this category) every node in the layer gets the output of every single node’s activation of the previous layer, having a separate weight for each of the nodes. This forms a series of sequential activations that flow through the network, where the final Figure 5: FFNN with an input layer followed by two hidden layers and an output layer.

Forward pass moves sequentially, layer after layer from left to right.

(11)

activations in output layer are considered the output of the network. This flow of input through the network to form the output is called forward pass. [34][77]

It is worth to note, that FFNNs consisting of multiple hidden layers are often referred to as deep neural networks (DNN) [27]. In this thesis however, the more general term of ANN is used, but in related literature DNN is another widely used term as most of the modern ANN models fall to the category of DNN.

2.4 Training, backpropagation and loss functions

During training, the goal is to adjust the weight values so that the network generates output that follows desired distribution (correct classification for example). The desired output in the case of teaching a model in a supervised manner is the annotated ground-truth value that is the pre-determined correct output value for the input, and part of the training dataset as the label to the data sample. [34]

In order to achieve such weight values a process called backpropagation is used, that essentially tries to minimize the cost-function. [34] Cost-function is often referred to as loss-function [40], and as such this thesis will refer to cost-function as loss-function. Loss- function is a function that is used to calculate the difference between the ground-truth value and the output of the network, thus the output of the loss function, loss, represents the error in the generated output. As the loss function calculates the error, the optimal performance for the model is achieved when loss, and thus the error in the output is min- imized. Minimizing a function can be done using differentiation by calculating the gradients and using gradient descent to find the lowest possible value of the function. This is the core of backpropagation. [34][21]

During backpropagation phase of the training process, the weights which produce an output with lowest possible loss are searched by calculating partial derivatives of the loss- function with respect to each weight parameter present in the network. This way the contribution of each weight to the loss can be evaluated and by shifting the weight values towards the calculated negative gradient the performance of the network shifts closer to the loss function’s minima (i.e. producing more accurate outputs) [34][21]. As the goal is to generate such weights that the network can correctly predict outputs for inputs with widely varying distributions (this is called generalization), usually the training is done in batches so that the output using multiple different input samples are used to calculate the loss at the same time, optimally resulting in the networks performance to grow more accurate on every domain presented in the input data samples [34][21]. This can be done for example by calculating the average loss for a collection of samples and using that for

(12)

the backpropagation. Furthermore, in order to achieve more stable learning process a hyper-parameter learning rate is utilized, that regularizes the strength of the weight adjust- ment by usually using some fraction of the calculated gradient instead of the raw gradient value [34][21]. This is mainly because changing weight values too drastically can cause the loss function to diverge (i.e. move further away from) instead of gradually moving closer to the minima [34][21].

In order to achieve the final acceptable levels of accuracy usually the training is done iteratively in multiple epochs. This means, that the training dataset is fed through the network multiple times, where a single epoch consists of going through all of the samples once. Every iteration of a training epoch consist of several forward passed followed by calculation of the loss and backpropagating the loss in order to adjust weights. [21]

As with the activation functions for nodes, the loss functions used are pre-defined functions and there exist many different functions used to calculate the loss. One example of a well known loss function is the Mean-Squared Error function as shown below:

𝑀𝑆𝐸 = 1

𝑁∑(𝑌 − Ŷ)²

𝑁

𝑖=1

2.5 Convolutional Neural Networks

Convolutional Neural Networks (CNN) are considered to be a subtype of FFNNs [63].

CNNs have seen a lot of success during recent years and have proved to be especially good in the field of image processing, so much so that most of the current state-of-the-art image classification models are based on CNN architectures (combined to Transformers) [20]. However, CNNs are nowadays used for not only to image classification, but for example in audio classification by convolving over graphical representation of the data (i.e. spectrograms) [35].

While traditional FFNNs consist of (often only) of fully connected layers of nodes, CNNs instead usually take advantage of three different types of layers: convolutional layers, pooling layers and fully-connected layers [33]. Compared to FFNNs, CNN reduce the number of parameters within the network by utilizing a learnable process in the form of so-called kernels (sometimes referred to as filters) [63]. Instead of having fully connected layers to each point in the input, the kernels are “slid” over said data, resulting in often multiple activations on a set of input values instead of a single activation over all the weighed sum of all input values like in FFNNs [63]. This process is called convolution.

(13)

Take for an example a coloured image with height and width of 128 pixels, the resulting input matrix would be of size 128 × 128 × 3 assuming the standard three colour channels used in images (i.e. red-green-blue). In the case of using standard fully connected layers of FFNNs, only a single neuron within a layer would require total of 49 152 connected weights as each value in the input matrix would be separately connected to said neuron. As layers in ANNs consists of more than one neuron, this would result in huge number of parameters, as even a single layer might require hundreds of thousands of parameters.

However, with CNN each layer consists of multiple filters, each of which are of size 𝐻 × 𝑊 × 𝐷, where H and W correspond to selected kernel height and width (for example 5 × 5) and D is the depth of input (i.e. in case of 128 × 128 × 3, D = 3). This results in total of 𝐻 × 𝑊 × 𝐷 parameters in each filter instead, thus the number of parameters is not dependent on the size of the input, but rather on the size of the kernel. As in CNN the weights are within the filters, this means that the network learns filters each of which focus on extracting different kinds of characteristics within a picture (e.g. detecting edges). This way, layer by layer the network extracts a collection of features from the input which are usually finally fed into a FNN which handles the final classification based on these features. [63] [33]

More specifically, the convolution process consists of the filter sliding over the input matrix calculating the sum of element-wise multiplication between the filter and the input in each position. This results in a matrix of convolved values to which an activation function is then applied to each of the resulted values. The result is always a 2D matrix which is called a feature map and is the output of the filter [63]. This process is shown in Figure 7.

Figure 6: One of the earliest CNN models, LeNet-5 proposed by LeCun et al. (Gradient- Based Learning Applied to Document Recognition)

(14)

Figure 7: Convolution applied to 4 × 4 × 1 input using 3 × 3 × 1 kernel, using stride = 1. Note how the output results in reduced dimensionalities at 2 × 2 × 1 due to no zero- padding applied. The kernel multiplies the values in the current position and sums them to determine the output value for that position.

Each convolutional layer consists of N filters, and output of each layer is a set of N feature maps. This can be thought as image like output, with depth equal to the number of filters when these feature maps are stacked. The subsequent layers apply the convolution over the stack of the feature maps of previous layer. This results in each layer (and filter) learn to extract different features present within the input data, with lower layers (layers closer to the input) in the CNN learning more general information (e.g. edges and curves), while deeper layers learning increasingly abstract features [33].

There are few other variables regarding the convolution process with regards to the kernel size as well as the way the kernel can be slid over the matrix. Namely, stride is used to determine how the kernel is moved over the matrix, more specifically stride determines how many “steps” the kernel takes each time it is moved. Using a larger stride result in less overlap between kernel applications and also reduces the size of dimensions of the resulting feature maps. Furthermore, sometimes zero-padding is utilized, where the bor- der of the input matrix is padded in order to control the size of the feature map. In many cases this seems to be used to produce feature maps that preserve the original height and with of the input in the feature map. Figure 8 presents the application of zero-padding, and in Figure 9 stride is shown combined to zero-padding. [63]

(15)

Figure 8: Grey zeroes represent the zero-padding. Utilizing zero padding, dimension of the output can be adjusted. In this case the original 4 × 4 dimension is preserved.

Figure 9: Zero padding and stride = 2 applied to the convolution. The red dots represent the position the kernel is centered to during convolution, effectively stride = 2 moves the kernel two steps each time. Stride > 1 results in reduced dimensions in the output feature matrix.

Furthermore, in order to increase the receptive field (i.e. the size of the area the filter is applied to) of the kernels sometimes dilation is introduced to the filter. What dilation does, is that it creates a kind of a kernel with “holes” as the dilation introduces steps within the kernel size. For example, using a 3 × 3 kernel with dilation of 1 would result in a receptive field of size 3 × 3. If we were to use dilation = 2 this would result in a receptive field of 5 × 5. This has been found to be effective way to increase accuracy of the network in some cases. [33][48]

(16)

Often in CNN architectures pooling layers are also commonly used in between of convolutional layers [63][33]. Pooling reduces the size of feature maps by combining the neigh- bouring inputs within a given area, where the result is used to represent that specific area.

This way, pooling reduces computational complexity (as the result is of reduced dimensional feature maps) and has also been found to be beneficial for regularizing the network.

Pooling also introduces shift-invariance (i.e. reduces the effect of slight changes in for example in position of the input) [33][84]. Common ways to perform the pooling operation is by max-pooling, where the maximum value inside the given area is considered, or by average-pooling where the pooling operation considers the average inside the area [33]. Furthermore, the size of the pooling filter is often 2 × 2, but other sized windows can also be utilized. During the pooling the stride is also usually set so that the pooling windows do not overlap [63].

Figure 10: Dilated convolution and the receptive field (blue) size using a 3 × 3 kernel with dilation factor of 2 (red).

(17)

After convolution layers a number of fully-connected layers are often used to detect patterns within the extracted features, and finally generates the output of the whole network.

This is similar to the pattern recognition within general FFNNs. While fully-connected layers are usually used as the final layers of the network, Gu et al. [33] points out that the fully-connected layers can be substituted by convolutional layers using 1 × 1 kernels.

Besides the general design of CNN as described above, there exist some variations for the convolutions. Namely, in context of this thesis, it is important to know the concepts of 1D CNNs and depthwise separable convolutions. The major difference with 1D CNN compared to usual CNNs is that as the name implies, 1D CNN deals with one dimensional data. In 1D CNNs instead of using a sliding window of ℎ𝑒𝑖𝑔ℎ𝑡 × 𝑤𝑖𝑑𝑡ℎ, a filter is only given the width, and the filter is slid over one dimension. 1D CNNs can be useful when dealing with time-series data (e.g. sound), where convolution can capture patterns within time frames. In a sense, from a 2D representation of 1D data, one dimensional convolution can be done by applying a filter where the height of the filter equals to the height of the 2D input. [38]

With depthwise separable convolutions, the convolution process is essentially just split in two separate processes. This means that instead of performing convolution with a filter that includes the channels (depth) of the data, the convolution is first done separately on each channel by using D filters with same height and width. Afterwards a pointwise con- volution is done, which essentially means that a 1 × 1 × 𝐷 convolutional filter is applied,

Figure 11: Pooling using 2 × 2 kernel indicated by the red square. Upper result demonstrates the more common max-pooling and lower shows average pooling

(18)

effectively applying the convolution across channels (depth). Thus, depthwise convolution resembles normal convolution, but done in two separate parts. [38]

2.6 Recurrent Neural Networks

Whereas CNN can be considered to be a subtype of FFNN, recurrent neural networks (RNN) are considered to be their own type of ANN method [69]. RNN architectures are designed to be able to analyse data in which the order of the data is of interest (i.e. time series). A fine example of such domain where RNNs have proven to be effective is in natural language processing (NLP), as in language the time frame where a word is en- countered (i.e. in what order the words appear) is of high importance for understanding said language [50].

Typically, RNN models achieve this ability of analysing sequential data by introducing recurrent or recursive connections between neurons. Often, these recurrent connections are made so that a neuron has a connection to itself, thus forwarding the output to itself.

These recurrent connections are used to preserve data from previous points within a time series, as the input to a given neuron is an input within time series as well as the output of the neuron given the previous input of the time series. The hidden state, ℎ_𝑡, of a node in RNN can be expressed as the function f with input 𝑥_𝑡 and the previous hidden state ℎ_𝑡−1 [50]:

ℎ_𝑡= { 0, 𝑤ℎ𝑒𝑛 𝑡 = 0

𝑓(ℎ_𝑡−1, 𝑥_𝑡), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (2.4)

While generally effective, the main issue with RNN models is that they can be quite hard to train effectively. This has seen mainly to be an issue resulting from two related phe- nomena called vanishing gradient and exploding gradient, that can occur when the input Figure 12: Recurrent connections within RNN. In the left, the self connection is shown.

In the right, the same network “rolled open” given 𝑋_𝑡 inputs.

(19)

sequence gets longer [29][65]. With vanishing gradient, the gradient gets closer to zero as the backpropagation through time (BTT) progresses backwards within the sequence.

Similarly in exploding gradient, the gradients can grow exponentially and result in extremely high values fast during BTT.

Some variant RNN architectures such as Long-Short Term Memory (LSTM) [37] and Gated Recurrent Unit (GRU) [17] have been introduced in an attempt to alleviate the vanishing/exploding gradient problem, both of which have been found to be quite effective. Thus, many of the modern RNN models prefer methods like LSTM and GRU, instead of the most general RNN architecture.

2.7 Sequence to Sequence, Attention and Transformers

Google introduced a sequence-to-sequence learning model in 2014 [74] that was used to map an input sequence to another sequence. The main idea in the architecture is to use two LSTM networks, where the first one is used to encode the input sequence (i.e. to extract features) and the latter part is used to decode the encoded output of the first network (i.e. construct sequence from the encoded data) [74]. This kind of architecture is also often called encoder-decoder, as per the structure of the network. One of the most prominent usages for this kind of architecture is machine translation, where input is a sentence using some language, and the network outputs the translation of the sentence to another language [18].

Figure 13: Encoder-Decoder model by Sutskever et al. [74].

However, as the encoder sequentially processes inputs within a sequence combined to the output of previous inputs activation as in the common way done in RNNs, the model was not that efficient in translating long sequences. This can be explained by the fact as the input sequence gets longer, the later activations have less and less information about the

(20)

earliest activations, as the whole sequence is sequentially encoded into a single output [8]

[13]. This also resulted in the latter layers of the encoder to have more effect on the final encoding. A solution to these problems were introduced later in [8] in the form of atten- tion.

Attention mechanism in ANN models are methods that are used to introduce context to models and an ability to focus on more relevant parts of the input [13]. To address the problem explained above, attention considers every single hidden state of the encoder during the final encoding of the sequence [8]. Furthermore, during the decoding phase the network uses additional learned attention weights to generate context vector that is used to impose additional weighing effect to the decoding process, that assigns different amount of importance (hence attention) to different parts of the encoded sequence [13].

Take for an example a sentence “dog is big”. In this case more informative words to any given classifier would be the words “dog” and the adjective “big” describing the dog.

Thus, giving more weight to these two words (paying attention) would be preferable. This way attention can be used to a) retain more information throughout the input sequence and (as all hidden states are considered) and b) have contextual weight values to pay more attention to more relevant parts of the input sequence [13].

Figure 14: Attention as demonstrated in [8]. Here bidirectional RNN is used (ie. the input sequence is read twice, once from each direction). For each input, the concatenation of forward and backward hidden states are regarded as the encoder’s output for the respective input. As can be seen, all hidden states (ℎ₁, ℎ₂. . . ℎ_𝑡) are considered towards the output of the encoder, denoted by (𝑎_𝑡,1, 𝑎_𝑡,2. . . 𝑎_𝑡,𝑇). Without using attention, only the hidden state of the last activation within RNN is considered, as seen in Figure 13.

(21)

Later, Transformer architecture was introduced by Vaswani et al. [80]. Transformers are a type of encoder-decoder style ANN models, that have provided state-of-the-art performance in translation tasks [86]. What makes transformers especially interesting, that while they are effective in NLP, which is a field highly dependent on time series, Trans- formers do not employ RNN structures at all. Instead, Transformers rely on self-attention mechanisms, more specifically on multi-head self-attention [80]. Self-attention is a type of attention, that is used to capture relationships within the same sequence, thus “attend- ing to itself” [80]. Transformers use this concept to capture the relevance of each input in a sequence in relation to other inputs. In addition, during the self-attention process mask- ing is used in order to prevent a given input in a sequence to “peek” ahead during training of the decoder (i.e. all inputs to the right of a given input in a sequence are ignored) [80].

The multi-head part of the self-attention means, that multiple self-attention functions are performed in parallel and the concatenation of which are used as the total attention. This allows the attention layer to attend to different informational representations of the inputs and found this process to improve the performance of the model [80].

Furthermore, as no RNN layers are used, [80] added positional information to the input sequence in the form of positional encoding, where the positional information vector was summed to the initial input in a input sequence. This way, the order of the sequence could be taken into account even without using RNNs.

Figure 15: Transformer [80]. Encoder-decoder network for sequential data, using only FFNN and Multi-Head Attention. Sequential information within input is retained using Positional Encoding.

(22)

Transformer is perhaps one of the most important developments in last few years, show- ing promising results in many tasks including image captioning [20], neural machine translation [86] and audio captioning [79].

2.8 Network initialization, data pre-processing and regularization

When it comes to using the concepts explained above, there are other variables to consider when designing efficient ANN models, especially when it comes to training them. Few important things to note are initialization of weights, pre-processing of data and additional regularization during training. These three are not the only approaches to more robust teaching of a model, as there are other methods that are often utilized in teaching and different approaches to determining suitable architectures for a networks for example.

However, as there is a lot to more advanced usage of ANNs it would go out of scope for this thesis and they are not considered here.

As explained, the learning within ANN models depends on the weight values that are adjusted during training. One might correctly assume that there needs to be some way to set the initial state of the network (i.e. the weight values) before starting the training. In fact, the initial weight values affect the final performance of the model and the time it takes to train it making proper weight initialization a very important task when training ANN models [59]. There exist many different ways to achieve this (Gaussian-, Xavier- and He initializations to name a few) [59][45], most of which are based on different kinds of random initializations, with different upper and lower limitations on set weight values.

Furthermore, it is possible to essentially “initialize” a network using previously learned weight values, thus using a pre-trained model as the base model. Using this way to initialize a model can result in improvements in accuracy as well as in reduced training time depending on the task and domains at hand. Training over a pre-trained model is often referred to as fine-tuning or transfer learning. (e.g. [76])

As ANN models learn patterns from within the given training data, a common problem when training networks is overfitting. This happens, when the network effectively learns to memorize the dataset instead of learning general patterns and their corresponding labels, leading to good results on the training data while underperforming on other data.

[34][46]

An example of a commonly used way to reduce overfitting in a network during training is regularization. In the context neural networks, “regularization” usually refers to a way to alter slightly the learning algorithm of the model, usually in order to help the model to

(23)

generalize better. One commonly used regularization technique is dropout. Dropout is simply used to add noise to the network by temporarily removing nodes during training.

Often the nodes selected to be “dropped” during training is random, with a parameter controlling the number of nodes to be ignored from a layer (e.g. 50% of the nodes in a layer are temporarily dropped). [73]

Another commonly used technique when training ANN models is data pre-processing, which has been found to help a model to generalize better, as well as sometimes to help reduce the amount of required parameters by reducing the input dimensions, resulting in more accurate models as well as in less computational cost involved during training. Pre- processing data is one of the most important tasks to enhance the quality of networks and there are many different approaches how to pre-process any given kind of data depending on the situation. In this thesis, pre-processing will not be discussed in depth, instead in Section 4.1 the pre-processing, or feature extraction, process related to audio features is presented. [10][61]

(24)

3 Continual Learning

3.1 Definition and motivation

While the concept, methods and development of Continual Learning have gained much more attention during last few years, the field of Continual Learning is still in its infancy.

While ANN models excel in adapting to existing datasets and detecting patterns, in real life scenarios the probability distributions for any given domain or task are rarely, if ever, static. Furthermore, in classification tasks it is not rare that new classes for a given task become available over time. The traditional way to train ANN models by batch training using complete datasets for a task suffers from degradation of performance on old tasks and domains when tasks are trained sequentially. This phenomenon is called catastrophic forgetting [30][57] and is a major challenge, solving of which is the main objective of continual learning [23][60][64][16]. In order to implement systems that are more autonomous and can adapt to changes in domain shift and introduction of new tasks, traditional training paradigm is not sufficient (due to catastrophic forgetting), thus the need for continual learning methods is apparent.

Defining what constitutes as Continual Learning can in some cases be difficult, and as the field itself is rather recent, no formal definition even exists. However, one good example for a definition of continual learning is presented in the book “Lifelong Machine Learning” by Chen and Liu [16]:

Lifelong machine learning (LML) is a continuous learning process. At any time point, the learner has performed a sequence of N learning tasks,𝑇₁, 𝑇₂, 𝑇₃, … , 𝑇_𝑁. These tasks, which are also called the previous tasks, have their corresponding datasets 𝐷₁, 𝐷₂, 𝐷₃, … , 𝐷_𝑁..

The tasks can be of the same type or different types and from the same domain or different domains. When faced with the (N+1)th task 𝑇_𝑁+1 (which is called the new or current task) with its data 𝐷_𝑁+1, the learner can leverage the past knowledge in the knowledge base (KB) to help learn 𝑇_𝑁+1. The objective of LML is usually to optimize the performance on the new task 𝑇_𝑁+1, but it can optimize on any task by treating the rest of the tasks as the previous tasks. KB maintains the knowledge learned and accumulated from learning the previous tasks. After the completion of learning 𝑇_𝑁+1, KB is updated with the knowledge (e.g., intermediate as well as the final results) gained from learning 𝑇_𝑁+1. The updating can involve consistency checking, reasoning, and metamining of additional higher-level knowledge.

Furthermore, they define five key characteristics for Lifelong Learning:

(25)

1. Continuous learning process

2. Knowledge accumulation and maintenance in the KB

3. The ability to use the accumulated past knowledge to help future learning 4. The ability to discover new tasks

5. The ability to learn while working or to learn on the job

Looking at the definition and the five points above, we can see that the Continuous Learn- ing is strongly defined by having a process that is able to teach a model new tasks or domains in sequential fashion. Effectively three out of the five points refer to this ability (1, 2 and 5). An interesting thing to note is that by the definition, not only is Continuous Learning a process of learning new tasks sequentially, but it is also characterized by learning new tasks faster/better (see point 3) by using the previously learned concepts to enhance the learning of new ones. This is similar to the concept called transfer learning where previously trained knowledge is used to help learning of a new task.

The definition of Continual Learning itself is not the only ambiguous definition when delving into the published research papers and books related to the field. As we talk about domains and tasks, and any model’s ability to approximate to given domain and/or task, the terminology of domain and task is often loosely defined, or in too many cases not defined at all. One challenge that anyone most probably runs into while researching Con- tinual Learning is, is that domain and task are in many cases used interchangeably [60].

Thus, one problem we face when talking about Continual Learning is in fact the lack of mutual understanding and well-defined terminology. In this thesis domain and task is also mostly used synonymously, with defining the usage based on given context of an example.

3.2 Catastrophic Forgetting

Catastrophic forgetting is a phenomenon where, during training a model using new data to learn new tasks or domains the performance on the previous tasks degrades, thus the network “forgets” previously learned knowledge. This has been documented in multiple different papers and publications [30][57][23][60][64][16]. As established in 3.2, learning new tasks in succession is part of the definition of continual learning, we can easily see why solving catastrophic forgetting is one of the main challenges in continual learning.

As explained in Chapter 2, training of ANN models consists of feeding data to the model, while weight values are updated depending on output and ground truth values of the data.

When training a new model, the weight values are usually initialized with random values, and the performance of the original model using randomized weights (i.e. untrained

(26)

model) is not considered important. However, if we need to further train a previously trained model, depending on the case we might need to take into the account the performance of the pretrained model on any given domain or task that it has been trained with.

This leads to a problem, where if no effort is made to preserve the performance of the model (i.e. the state of the model at the beginning of the training), the shifting weight values during training will override the original weight values, which results in significant accuracy losses on the original domain or task the model was trained in, thus forgetting previously known concepts (catastrophic forgetting). One example where this behaviour can be seen is during transfer training, as the concept of transfer training only concerns itself with using previous knowledge in order to help learning new tasks while not considering the performance on original tasks as important, effectively just treating the original network as a special case of weight initialization [60].

When talking about catastrophic forgetting, it is worth noting, that even during initial training “forgetting” occurs as the initial (though random) performance will not be preserved in any way. This is usually not considered catastrophic forgetting, but in order to have consistent definition of the phenomenon, the shift that occurs on random initialized weight values should be considered as a special, beneficial, case of catastrophic forgetting. This also highlights the fact, that in order to adapt to any distribution of data, the ability to forget is a necessity. Thus, forgetting is actually an integral part of learning.

Even though this seems obvious, the need for forgetting is rarely explicitly discussed in continual learning research. M.Allred and Roy briefly discuss this, and present a method called “Controlled Forgetting” [4]. This is noteworthy, as when talking about catastrophic forgetting, the true issue is not the forgetting itself, but how, when and what the network forgets during training. Thus, when the goal of researchers is to solve catastrophic forgetting, what probably describes it better is, in fact, a goal to gain better understanding of controlled forgetting.

The need for controlled forgetting is, however, implicitly presented in many publications by presenting the stability-plasticity dilemma. Stability-plasticity dilemma discusses the problem of determining in which degree should a model be plastic in order to be able to adjust to changes, and stabile so as to not run into the problem of catastrophic forgetting.

This is very close to the problem of controlled forgetting, with the difference of not having any positive connotations assigned to forgetting itself. [23][60][64]

Considering the nature of knowledge in real life domains, in many cases we can witness this beneficial side of neural networks’ ability to forget. As mentioned before, it is rare that a domain “in the wild” is static. That is, the probability distribution of data in many

(27)

domains change over the time. Take for example a recommendation system for an online shopping service. Assume that the service has been around for 15 years. During that time popular trends have changed, the available items have changed, and it is even possible that a single customer’s interests have changed as well, thus domain shift has occurred.

If a single model has been trained to predict customer’s interest in continual fashion, in order to stay accurate, forgetting old, irrelevant data becomes mandatory.

Presenting the beneficial side of forgetting is imperative in order to understand, that even with Continual Learning, forgetting past knowledge to some degree is desirable. This is because in many real scenarios retaining all of the past knowledge might sometimes be detrimental to the present form of the task due to the domain shift. This of course, is heavily dependent on the needs for task and domain itself.

3.3 Continual Learning methods

In order to solve the issue of catastrophic forgetting, often the traditional approach has been to retrain models using all of the available data for every task or domain that we want to teach the model to evaluate, this is sometimes called Joint Training (though the definition of Joint Training seems to vary between publications) [23][16]. While this approach has been shown to be effective it poses plethora of issues [23]. First, as in order to train an accurate model for a single task, it is not uncommon to use datasets that consists of thousands of samples in order to have a diverse dataset. When dealing with set of multiple tasks it naturally results in a need of data for all of these tasks, and all of this data has to be available during training. This quickly leads to the need of enormous datasets, storing of which takes a huge amount of resources. Second, in real life situations the availability of any previous data is not given [23][64][2]. This can lead to situations where there is need to further train a model, but no data used to train the base model is accessible.

Thus, making it impossible to retrain the model from scratch using data for all of the required tasks. Finally, training ANNs can require a lot of time. Even training a model for a single task can take multiple days and is highly depended on the size of the dataset used. This results in fully retraining a model, using an ever-expanding dataset, extremely time consuming.

While the above-mentioned Joint Training is a common way to “solve” catastrophic forgetting, many scholars do not regard it as a continual learning method [16]. This is because, when looking at the case of Joint Learning, we can see that even if the model is learning data from multiple domains and/or tasks, as the process is not done in sequential fashion it cannot be considered to a case of Continuous Learning (as per the definition in 3.2). In fact, as with Joint Training the model is trained with all the data, the model is

(28)

always effectively trained with a single compound task that might consist of different tasks and domains. This means, that instead of solving the issue of catastrophic forgetting Joint Training tries to go around it by combining datasets.

Advancement in the research of Continual Learning has led to multitude of different approaches to tackle the challenge of catastrophic forgetting. In recent literature, attempt has been made to categorize these different kinds of methods, based on the type of imple- mentation presented. The most common way is to divide the methods to three distinct groups: regularizing, dynamic architectures and rehearsal. However, even though the division is often made to these three groups, the terminology still varies. Most publications have the common definition of the first group, regularization, though some excep- tions apply. For the remaining two groups, however, while the logical presentation of these groups are in many cases consistent between literature, the terminology does seem to vary. For example, dynamic architectures has been referred to as parameter isolation and rehearsal as memory replay (although, memory replay is actually a sub category of rehearsal, this is explained later). Furthermore, it has been suggested that a fourth group exists that consists of other methods, which in many cases incorporate characteristics of multiple of the presented groups. This thesis will refer to these groups as: regularization, dynamic architectures, rehearsal and combination approaches. [23][60][64][2]

3.3.1 Regularization based approaches

Regularization approaches focus on alleviating catastrophic forgetting by introducing regularizing factors to the algorithm during weight update [23][60]. Thus, the goal is to achieve a regularizing effect that keeps the updating of weights sufficiently plastic in order to learn novel information, while also staying stabile and forgetting old information as least as possible. Mundt et al. [60] further divide this category to two subcategories:

structural and functional. Structural regularizing methods, such as Elastic Weight Con- solidation (ECW) and Synaptic Intelligence (SI), focus on regularization in all parts of the network, whereas functional methods, such as Learning without Forgetting (LwF), focus on keeping the output probability distribution close to the original values of the trained model [60]. The regularization is often based on adding additional regularizing terms to the training algorithm. These terms can be additional terms to loss, often present in functional methods, or terms to penalize the update function on specific weight values in structural methods.

Elastic Weight Consolidation (EWC) is a method introduced by Kirkpatrick et al. [43], that is inspired by human brain’s ability to preserve information by reducing plasticity of

(29)

synapses that are important to a task. To achieve continual learning in ANN, this is sim- ulated by regularizing the learning of a new task by slowing down weight updates on weights depending on the importance of the said weight on previous task(s). This means, that weight updates on more important weights for old tasks are effectively given a lower learning rate simulating the reduced plasticity. The importance of weights is determined by using Fisher information matrix that is calculated between training of separate tasks.

[43]

Synaptic Intelligence (SI) [85] approach is fairly similar to the EWC in the sense that it is also based on plasticity reduction on important parameters. The strength of SI compared to EWC lies in the way SI calculates this importance measure, 𝑤_𝑘^𝜇, in online fashion during training, which makes it less computational heavy compared to EWC. This importance measure of a parameter corresponds to the contribution of a specific parameter to the total loss, measured at task 𝑇_𝑡−1. When training T, this information is used to impose penalty on parameters that had high impact on the loss during learning of previous task. According to Zenke et al. [85], EWC and SI have showed similar performance on MNIST dataset.

While the two examples explained above belong to the subgroup of structural regurali- zers, Learning without Forgetting [49] is a functional regularizing method. Thus, the basic idea behind LwF is to impose a regularizing factor to the loss function, instead of changes to specific weight parameters. This is based on the idea of knowledge distillation [36], where output of the forward pass (of larger network) for data is stored as soft-labels which are used to transfer knowledge to smaller network by using these soft-labels as target values (like ground truth values) to train the network. LwF uses similar kind of approach, where output of forward pass for new task is stored as soft-labels which are effectively used as a secondary ground truth values during training of the new task, by adding an additional loss term to the loss function [49]. The goal of this is to keep the output distribution to change as little as possible compared to the original state of the model, thus preserving the knowledge of the original task. However, this comes with the downside that the performance is dependent on the relevance between tasks. This method will be explained in depth in Chapter 5.

As the regularizing methods are usually based on adding additional loss terms that focus on preserving previous tasks, the downside in many cases seem to be computational complexity that comes with generating relevant data to calculate the loss with. Furthermore, as the loss of new tasks and added loss parameters are usually somehow balanced, usually by using a parameter, it often introduces a trade-off between learning a new task and

(30)

forgetting the previous, thus in some cases preventing optimal performance on new task as the old knowledge needs to be preserved. [64]

3.3.2 Rehearsal based approahces

Rehearsal (and pseudo-rehearsal) based methods are based on interleaving data with similar distribution to old data to a new dataset during training. This is usually either done by finding and storing a suitable subset of a previously used dataset or using a generator to create samples that resemble the old dataset. The generator-based approach is sometimes also called pseudo-rehearsal. As the whole dataset can be considered a subset of itself, the most naïve approach for a rehearsal-based approach is to store all of the used datasets for future use. As explained before, this Joint Training is an effective way to negate catastrophic forgetting but poses multitude of issues that makes it unfeasible to implement.

[23][60][64]

A fine example of a rehearsal method is a generative method introduced by Shin et al.

[72], Deep Generative Replay (DGR). This method utilizes a generative model, that is trained along with the actual classifying model. When encountering a new task, this generative model, or scholar, is then used during training to generate samples that resemble the previously encountered data that is interleaved with the new dataset. While training the scholar with a new task, the previously trained scholar can also be used to generate samples in order to retain the knowledge in the generative model. By adjusting the ratio of new and generated data, the weight of which is placed on retaining old knowledge can be adjusted. As shown in the paper, DGR is also compatible at least with regularizing methods, where using a method that used a combination of LwF and generated samples (DGR+LwF) proved to be an effective way to alleviate catastrophic forgetting.

As mentioned above, using a subset of a dataset is another example for rehearsal based continual learning. These subsets are often also referred to as exemplars [60]. For these methods, there seem to be two main ways to select this subset to be used:

1) Selecting the samples randomly from the dataset (e.g. [55]).

2) Selecting a subset of samples that approximate the distribution of the whole dataset as closely as possible. (e.g. [67])

Gradient Exemplar Memory (GEM) is one example where, an effectively random selec- tion of exemplars was used [55]. With GEM the used model is allocated a memory budget of M for storing samples of previous task(s). The amount of samples stored for each task, m, depends on the number of total tasks as per the equation 𝑚 = ^𝑀

𝑇, where T is the

(31)

amount of known tasks. In case T is not known prior, m is dynamically decreased as new tasks are encountered. As mentioned, the way GEM manages its exemplar selection (in its original paper at least) is random, as the last m samples of each task are stored. During training, GEM calculates the loss changes on these stored samples, while allowing de- creasing of the loss and preventing the increase. This means, that GEM supports positive transfer, that is improvement of performance on old tasks, which for example the presented regularizing methods are not capable of. An improved version of GEM, called A- GEM [14], also exists that is both computationally and by memory usage a less expensive version of GEM.

An example of a method that uses exemplars selected by the second method is iCarl [67].

iCarl allocates memory of size K to store exemplars. The set of exemplars is then updated when new tasks are encountered, thus replacing some of the old exemplars with ones from the new dataset while never using more memory than K to store samples. iCarl also uses knowledge distillation while training to retain old knowledge.

3.3.3 Dynamic Architectures

One of the more intuitive ways to solve catastrophic forgetting is by increasing the neural resources of the network as new tasks are introduced, this is called Dynamic Architectures [64]. As the name dynamic architecture implies, this idea works as the basis for these kinds of approaches. In some papers this is considered a sub type of a method that is referred to as parameter isolation [23], which categorizes methods that allocate different tasks to different parameters. However, this thesis refers to dynamically changing network architectures as its own group.

One of such methods is neurogenesis deep learning (NDL) [24], which (as the name sug- gests) is inspired by the biological concept of neurogenesis where new neurons are formed in the brain. NDL tries to achieve the goal of continual learning by adding new neural resources as more information becomes available and the existing resources are deemed to be insufficient for the task. To achieve this, NDL architecture includes an encoder-decoder structure that tries to generate an output that is close to the input. In other words, reconstructing the input. To utilize this generator for determining if new neural resources should be allocated, a metric, reconstruction error (RE), is tracked. RE is used to calculate the error between the original and reconstructed input. When this error reaches a given threshold new neurons are generated. During training the network with added neurons NDL uses rehearsal, by either using stored samples for old data or using the generative power of the decoder to generate samples, thus resembling pseudo-rehearsal. The authors call this generator intristic replay (IR).

(32)

Another example for dynamic architectures is Dynamically Expandable Networks (DEN) [83]. The training process of DEN consists of three phases, all of which might not be necessary during training of each task:

1) Selective retraining to train only the parameters that are of importance to a given task. If desired performance can be achieved only by this, no further actions are needed.

2) If no loss lower than selected threshold can be found during selective retraining, DEN adds new appends the network with new neurons. During training with added neurons, sparsity regularization is used to determine which added neurons are important to the network while dropping the rest of them, which helps in making the model less resource hungry.

3) If significant shift between tasks is detected by observing the magnitude of change in weights, the neurons which the change affected the most are duplicated and added to the corresponding layer. The method describes this to be followed by retraining the model, where weights are initialized using the weights of the model at that time. While referred to as retraining, this actually more closely resembles fine tuning or transfer learning.

As we can see, like with NDL, DEN also does not rely just on the expanding architecture, rather treats it as a “last resort” if the used regularization techniques does not provide powerful enough model.

3.3.4 Combination-based approaches

Few scholars have pointed out that a fourth category for continual learning methods exist, that is combination-based approaches [60]. As seen in some of the presented approaches, like DEN and NDL, a combination of different approaches is used often to achieve better performance as well as to save resources. This is in line with Chen and Liu [16], where they point out that it is very likely that a robust continual learning method cannot be implemented without incorporating multiple learning algorithms.

One might argue that when categorizing methods, we are in fact categorizing the individ- ual approaches that any given method might consist of. For this reason, the category

“combination approaches” serves little to no purpose if we were to categorize different methods. Especially when considering the assumption that combination of multiple approaches is needed to achieve truly autonomous continually learning systems, categorizing the method as “combination method” would do very little in describing the method.

(33)

Given how the new state-of-the-art approaches usually utilize different methods, describing those systems as a “combination based continual learning approaches” has very little information value in it, as no description on what kind of approaches are actually used in combination is provided.

The examples presented here are of the most common and well-known continual learning algorithms. Thus, I want to remind, that there exist plethora of methods that are not presented in this thesis (eg. [41][56]). As we can see, most of these methods have been presented fairly recently, with many first published between 2016 and 2020, which indicates the increase in interest and need to solve catastrophic forgetting and achieving robust autonomous continually learning systems.

A major focus of this thesis is based on the presented regularizing method, LwF, which is a simple, yet effective algorithm for continual learning. In Chapter 5, a more in depth look into the method is presented, applied to a type of Audio Classification problem called Audio Captioning, which will be presented in Chapter 4.

3.4 Evaluation of Continual Learning

As the Continual Learning is still arguably in its infancy, the evaluation for methods is still not very consistent between experiments shown in various research papers [64]. Fur- thermore, the amount of datasets specifically tailored to evaluate Continual Learning seems to be lacking. One example for such dataset is CORe50, that includes 50 different classes of everyday objects in 10 different categories [53].

While few continual learning datasets do exist, most of the continual learning research has been using the same datasets that are commonly used to evaluate ANNs in general.

Especially the commonly used datasets, like MNIST are often used [64][28]. However, with continual learning approach there are slight changes to how these datasets are used.

For example, in the case of MNIST, the dataset has been used as a “Split MNIST” where the dataset has been split to five different subsets, each of which consist of samples of only 2 of the 10 classes present in the whole MNIST dataset [28][39]. Thus, for example the first subset would consist of samples with annotated classes of 0 or 1. Further, this Split MNIST has been utilized to evaluate continual learning in at least the following two ways [28][39]:

1) Multi-headed, where essentially the model is taught five different classifiers and during inference the task identifier is given to the model. In other words, the model