• Ei tuloksia

Continual Learning methods

3 Continual Learning

3.3 Continual Learning methods

In order to solve the issue of catastrophic forgetting, often the traditional approach has been to retrain models using all of the available data for every task or domain that we want to teach the model to evaluate, this is sometimes called Joint Training (though the definition of Joint Training seems to vary between publications) [23][16]. While this ap-proach has been shown to be effective it poses plethora of issues [23]. First, as in order to train an accurate model for a single task, it is not uncommon to use datasets that consists of thousands of samples in order to have a diverse dataset. When dealing with set of mul-tiple tasks it naturally results in a need of data for all of these tasks, and all of this data has to be available during training. This quickly leads to the need of enormous datasets, storing of which takes a huge amount of resources. Second, in real life situations the availability of any previous data is not given [23][64][2]. This can lead to situations where there is need to further train a model, but no data used to train the base model is accessible.

Thus, making it impossible to retrain the model from scratch using data for all of the required tasks. Finally, training ANNs can require a lot of time. Even training a model for a single task can take multiple days and is highly depended on the size of the dataset used. This results in fully retraining a model, using an ever-expanding dataset, extremely time consuming.

While the above-mentioned Joint Training is a common way to “solve” catastrophic for-getting, many scholars do not regard it as a continual learning method [16]. This is be-cause, when looking at the case of Joint Learning, we can see that even if the model is learning data from multiple domains and/or tasks, as the process is not done in sequential fashion it cannot be considered to a case of Continuous Learning (as per the definition in 3.2). In fact, as with Joint Training the model is trained with all the data, the model is

always effectively trained with a single compound task that might consist of different tasks and domains. This means, that instead of solving the issue of catastrophic forgetting Joint Training tries to go around it by combining datasets.

Advancement in the research of Continual Learning has led to multitude of different ap-proaches to tackle the challenge of catastrophic forgetting. In recent literature, attempt has been made to categorize these different kinds of methods, based on the type of imple-mentation presented. The most common way is to divide the methods to three distinct groups: regularizing, dynamic architectures and rehearsal. However, even though the division is often made to these three groups, the terminology still varies. Most publica-tions have the common definition of the first group, regularization, though some excep-tions apply. For the remaining two groups, however, while the logical presentation of these groups are in many cases consistent between literature, the terminology does seem to vary. For example, dynamic architectures has been referred to as parameter isolation and rehearsal as memory replay (although, memory replay is actually a sub category of rehearsal, this is explained later). Furthermore, it has been suggested that a fourth group exists that consists of other methods, which in many cases incorporate characteristics of multiple of the presented groups. This thesis will refer to these groups as: regularization, dynamic architectures, rehearsal and combination approaches. [23][60][64][2]

3.3.1 Regularization based approaches

Regularization approaches focus on alleviating catastrophic forgetting by introducing regularizing factors to the algorithm during weight update [23][60]. Thus, the goal is to achieve a regularizing effect that keeps the updating of weights sufficiently plastic in order to learn novel information, while also staying stabile and forgetting old information as least as possible. Mundt et al. [60] further divide this category to two subcategories:

structural and functional. Structural regularizing methods, such as Elastic Weight Con-solidation (ECW) and Synaptic Intelligence (SI), focus on regularization in all parts of the network, whereas functional methods, such as Learning without Forgetting (LwF), focus on keeping the output probability distribution close to the original values of the trained model [60]. The regularization is often based on adding additional regularizing terms to the training algorithm. These terms can be additional terms to loss, often present in functional methods, or terms to penalize the update function on specific weight values in structural methods.

Elastic Weight Consolidation (EWC) is a method introduced by Kirkpatrick et al. [43], that is inspired by human brain’s ability to preserve information by reducing plasticity of

synapses that are important to a task. To achieve continual learning in ANN, this is sim-ulated by regularizing the learning of a new task by slowing down weight updates on weights depending on the importance of the said weight on previous task(s). This means, that weight updates on more important weights for old tasks are effectively given a lower learning rate simulating the reduced plasticity. The importance of weights is determined by using Fisher information matrix that is calculated between training of separate tasks.

[43]

Synaptic Intelligence (SI) [85] approach is fairly similar to the EWC in the sense that it is also based on plasticity reduction on important parameters. The strength of SI compared to EWC lies in the way SI calculates this importance measure, 𝑤𝑘𝜇, in online fashion dur-ing traindur-ing, which makes it less computational heavy compared to EWC. This im-portance measure of a parameter corresponds to the contribution of a specific parameter to the total loss, measured at task 𝑇𝑡−1. When training T, this information is used to im-pose penalty on parameters that had high impact on the loss during learning of previous task. According to Zenke et al. [85], EWC and SI have showed similar performance on MNIST dataset.

While the two examples explained above belong to the subgroup of structural regurali-zers, Learning without Forgetting [49] is a functional regularizing method. Thus, the basic idea behind LwF is to impose a regularizing factor to the loss function, instead of changes to specific weight parameters. This is based on the idea of knowledge distillation [36], where output of the forward pass (of larger network) for data is stored as soft-labels which are used to transfer knowledge to smaller network by using these soft-labels as target values (like ground truth values) to train the network. LwF uses similar kind of approach, where output of forward pass for new task is stored as soft-labels which are effectively used as a secondary ground truth values during training of the new task, by adding an additional loss term to the loss function [49]. The goal of this is to keep the output distri-bution to change as little as possible compared to the original state of the model, thus preserving the knowledge of the original task. However, this comes with the downside that the performance is dependent on the relevance between tasks. This method will be explained in depth in Chapter 5.

As the regularizing methods are usually based on adding additional loss terms that focus on preserving previous tasks, the downside in many cases seem to be computational com-plexity that comes with generating relevant data to calculate the loss with. Furthermore, as the loss of new tasks and added loss parameters are usually somehow balanced, usually by using a parameter, it often introduces a trade-off between learning a new task and

forgetting the previous, thus in some cases preventing optimal performance on new task as the old knowledge needs to be preserved. [64]

3.3.2 Rehearsal based approahces

Rehearsal (and pseudo-rehearsal) based methods are based on interleaving data with sim-ilar distribution to old data to a new dataset during training. This is usually either done by finding and storing a suitable subset of a previously used dataset or using a generator to create samples that resemble the old dataset. The generator-based approach is sometimes also called pseudo-rehearsal. As the whole dataset can be considered a subset of itself, the most naïve approach for a rehearsal-based approach is to store all of the used datasets for future use. As explained before, this Joint Training is an effective way to negate cat-astrophic forgetting but poses multitude of issues that makes it unfeasible to implement.

[23][60][64]

A fine example of a rehearsal method is a generative method introduced by Shin et al.

[72], Deep Generative Replay (DGR). This method utilizes a generative model, that is trained along with the actual classifying model. When encountering a new task, this gen-erative model, or scholar, is then used during training to generate samples that resemble the previously encountered data that is interleaved with the new dataset. While training the scholar with a new task, the previously trained scholar can also be used to generate samples in order to retain the knowledge in the generative model. By adjusting the ratio of new and generated data, the weight of which is placed on retaining old knowledge can be adjusted. As shown in the paper, DGR is also compatible at least with regularizing methods, where using a method that used a combination of LwF and generated samples (DGR+LwF) proved to be an effective way to alleviate catastrophic forgetting.

As mentioned above, using a subset of a dataset is another example for rehearsal based continual learning. These subsets are often also referred to as exemplars [60]. For these methods, there seem to be two main ways to select this subset to be used:

1) Selecting the samples randomly from the dataset (e.g. [55]).

2) Selecting a subset of samples that approximate the distribution of the whole da-taset as closely as possible. (e.g. [67])

Gradient Exemplar Memory (GEM) is one example where, an effectively random selec-tion of exemplars was used [55]. With GEM the used model is allocated a memory budget of M for storing samples of previous task(s). The amount of samples stored for each task, m, depends on the number of total tasks as per the equation 𝑚 = 𝑀

𝑇, where T is the

amount of known tasks. In case T is not known prior, m is dynamically decreased as new tasks are encountered. As mentioned, the way GEM manages its exemplar selection (in its original paper at least) is random, as the last m samples of each task are stored. During training, GEM calculates the loss changes on these stored samples, while allowing de-creasing of the loss and preventing the increase. This means, that GEM supports positive transfer, that is improvement of performance on old tasks, which for example the pre-sented regularizing methods are not capable of. An improved version of GEM, called A-GEM [14], also exists that is both computationally and by memory usage a less expensive version of GEM.

An example of a method that uses exemplars selected by the second method is iCarl [67].

iCarl allocates memory of size K to store exemplars. The set of exemplars is then updated when new tasks are encountered, thus replacing some of the old exemplars with ones from the new dataset while never using more memory than K to store samples. iCarl also uses knowledge distillation while training to retain old knowledge.

3.3.3 Dynamic Architectures

One of the more intuitive ways to solve catastrophic forgetting is by increasing the neural resources of the network as new tasks are introduced, this is called Dynamic Architectures [64]. As the name dynamic architecture implies, this idea works as the basis for these kinds of approaches. In some papers this is considered a sub type of a method that is referred to as parameter isolation [23], which categorizes methods that allocate different tasks to different parameters. However, this thesis refers to dynamically changing net-work architectures as its own group.

One of such methods is neurogenesis deep learning (NDL) [24], which (as the name sug-gests) is inspired by the biological concept of neurogenesis where new neurons are formed in the brain. NDL tries to achieve the goal of continual learning by adding new neural resources as more information becomes available and the existing resources are deemed to be insufficient for the task. To achieve this, NDL architecture includes an en-coder-decoder structure that tries to generate an output that is close to the input. In other words, reconstructing the input. To utilize this generator for determining if new neural resources should be allocated, a metric, reconstruction error (RE), is tracked. RE is used to calculate the error between the original and reconstructed input. When this error reaches a given threshold new neurons are generated. During training the network with added neurons NDL uses rehearsal, by either using stored samples for old data or using the generative power of the decoder to generate samples, thus resembling pseudo-re-hearsal. The authors call this generator intristic replay (IR).

Another example for dynamic architectures is Dynamically Expandable Networks (DEN) [83]. The training process of DEN consists of three phases, all of which might not be necessary during training of each task:

1) Selective retraining to train only the parameters that are of importance to a given task. If desired performance can be achieved only by this, no further actions are needed.

2) If no loss lower than selected threshold can be found during selective retraining, DEN adds new appends the network with new neurons. During training with added neurons, sparsity regularization is used to determine which added neurons are important to the network while dropping the rest of them, which helps in mak-ing the model less resource hungry.

3) If significant shift between tasks is detected by observing the magnitude of change in weights, the neurons which the change affected the most are duplicated and added to the corresponding layer. The method describes this to be followed by retraining the model, where weights are initialized using the weights of the model at that time. While referred to as retraining, this actually more closely resembles fine tuning or transfer learning.

As we can see, like with NDL, DEN also does not rely just on the expanding architecture, rather treats it as a “last resort” if the used regularization techniques does not provide powerful enough model.

3.3.4 Combination-based approaches

Few scholars have pointed out that a fourth category for continual learning methods exist, that is combination-based approaches [60]. As seen in some of the presented approaches, like DEN and NDL, a combination of different approaches is used often to achieve better performance as well as to save resources. This is in line with Chen and Liu [16], where they point out that it is very likely that a robust continual learning method cannot be implemented without incorporating multiple learning algorithms.

One might argue that when categorizing methods, we are in fact categorizing the individ-ual approaches that any given method might consist of. For this reason, the category

“combination approaches” serves little to no purpose if we were to categorize different methods. Especially when considering the assumption that combination of multiple ap-proaches is needed to achieve truly autonomous continually learning systems, categoriz-ing the method as “combination method” would do very little in describcategoriz-ing the method.

Given how the new state-of-the-art approaches usually utilize different methods, describ-ing those systems as a “combination based continual learndescrib-ing approaches” has very little information value in it, as no description on what kind of approaches are actually used in combination is provided.

The examples presented here are of the most common and well-known continual learning algorithms. Thus, I want to remind, that there exist plethora of methods that are not sented in this thesis (eg. [41][56]). As we can see, most of these methods have been pre-sented fairly recently, with many first published between 2016 and 2020, which indicates the increase in interest and need to solve catastrophic forgetting and achieving robust autonomous continually learning systems.

A major focus of this thesis is based on the presented regularizing method, LwF, which is a simple, yet effective algorithm for continual learning. In Chapter 5, a more in depth look into the method is presented, applied to a type of Audio Classification problem called Audio Captioning, which will be presented in Chapter 4.