Training, hyper-parameters and evaluation

5 Regularization based Continual Learning method for Audio Cap- Cap-tioning

6.3 Training, hyper-parameters and evaluation

To evaluate the effectiveness of the proposed continual learning method, three distinct baseline scenarios are considered. For first two, the performance of 𝔻_new on a model trained on 𝔻_ori and the performance of 𝔻_ori on a model optimized on 𝔻_new both need to be considered to have baseline performance of each model without any kind of further training applied. Furthermore, the further training of the model pre-trained on 𝔻_ori using naïve fine-tuning on 𝔻_new, gives us the baseline to consider how the model behaves when further trained. Arguably, the result of fine-tuning is the most important metric when con-sidering whether the applied continual learning method is effective, as the catastrophic

forgetting in neural networks occur during the naïve fine-tuning. To sum it up this re-search considers the following three baseline scenarios:

1. WT pre-optimized on Clotho dataset and evaluated on both Clotho and Audi-oCaps

2. WT pre-optimized on AudioCaps and evaluated on Clotho and AudioCaps 3. WT pre-optimized on Clotho, fine-tuned with AudioCaps, and evaluated with

both Clotho and AudioCaps

The above cases are termed as 𝑊𝑇_{𝑐𝑙-𝑎𝑢}, 𝑊𝑇_{𝑎𝑢-𝑐𝑙} and 𝑊𝑇_{𝑐𝑙-𝑓𝑡}, respectively.

During the pre-training process of 𝑀_{𝑏𝑎𝑠𝑒} the model is trained using the training split of the corresponding dataset (e.g. 𝑊𝑇_{𝑐𝑙-𝑎𝑢} is pre-trained on Clotho’s training split). The training process monitors the validation split of the dataset used for training, and the final optimized 𝑀_{𝑏𝑎𝑠𝑒} is considered to be reached after no improvements are detected in the SPIDEr score for 10 consecutive epochs. Early stopping of training is applied after reach-ing the desired performance. As the optimizer, Adam is used with hyper-parameters pro-posed in [79]. Furthermore, temperature as a hyper-parameter is used for the final soft-max classifier during training of 𝑀_new. In this thesis, the value of 2 is used for the tem-perature hyper-parameter.

Additionally, the hyper-parameters, λ and B, related to the method presented in Section 5.1.2 are considered during evaluation. These values are, λ = 0.70, 0.75…0.95, 1.0 and B

= 4, 8, 12. For the experiments, the full range of the combination of these two parameters considered, this resulted in total of 21 experiment cases for the research.

The model pre-trained with Clotho uses the pre-optimized weight values as given in [79].

In order to simulate incoming data stream 𝑆, mini-batches of size B are sampled from AudioCaps. Furthermore, in order to simulate the continual learning from a data stream, each sample of AudioCaps is used for training only once (i.e. training for one epoch). The performance of the 𝑀_new is evaluated after every 50^th, 75^th and 150^th weight updates, with B = 12, 8, 4, respectively. Effectively this means that the model is evaluated after every 600^th individual sample regardless of the batch size. This is to keep the same number of evaluations for the training process for every batch size as well as to reduce the time required to train for a single epoch, given that evaluating the model in between training batches requires a fair amount of time. Final evaluation is done after having trained with all the samples in AudioCaps. For evaluation, the SPIDEr scores are considered as the appropriate metric.

7 Results

In Table 1, the SPIDEr metrics on 𝔻_ori and 𝔻_new are shown for the three presented base-line scenarios. As can be seen from the Table 1, the basebase-line results for both 𝑊𝑇_{𝑐𝑙-𝑎𝑢} and 𝑊𝑇_{𝑎𝑢-𝑐𝑙}, that the performance on 𝔻_ori is significantly better in both cases. This is a clear indication that the performance of the model is not desirable on both datasets, given that the model is trained on only one of the used datasets. Furthermore, the third baseline scenario of fine-tuning the model, 𝑊𝑇_{𝑐𝑙-𝑓𝑡}, shows clear signs of catastrophic forgetting, as while the performance of 𝔻_new improves the performance of 𝔻_ori decreases signifi-cantly. Thus, the need for applying a continual learning method becomes apparent in or-der to reduce the magnitude of catastrophic forgetting.

Table 2: Baseline scenarios.

Table 1: Results. Highest performance on 𝔻_ori and 𝔻_new are indi-cated by bolding.

The Table 2 shows the corresponding SPIDEr scores for each pair of λ and B. First, it seems that the used batch size during training does have an impact to some degree on the results. More specifically, the lower B values result in lower performance on 𝔻_𝑜𝑟𝑖. As in the case of B = 4, 𝔻_ori is lower regardless of the used value of λ, with the exception of λ

= 1. The same holds mostly true for B = 8 and B > 12. This indicates that the used batch size for the data stream 𝑆 introduces some amount of regularization to the training process and affects the model’s ability to preserve old information from 𝔻_ori. This can most likely be explained by the fact that batch size by design affects the frequency of weights updates applied to the model. Specifically, using lower batch sizes result in more frequent, thus faster, change to the model’s parameters. As in this experiment, instead of training until convergence, the training is done in a single epoch. This means that when training with higher values of B the model will not reach similar amounts of parameter updates than lower values. This is different from training until convergence, as in that case the training can use as many epochs as needed. This effect can be seen as the faster shift towards the higher performance on 𝔻_new in the data in Table 2, which in turn decreases the perfor-mance on 𝔻_ori.

Looking at Table 2, the impact of λ can also be observed and follows mostly the expected trend given the Equation 5.6 in Section 5.1.2. As the λ introduces the factor of 1 – λ and λ to the ℒ_new and ℒ_reg, respectively, we can see that using higher values of λ, gradually reduces the contribution of ℒ_new to total loss to 0, while giving more weight on ℒ_reg. Thus, the expected behaviour is that the training process ignores ℒ_new completely when λ = 1, thus focusing completely on keeping the output similar. Table 2 shows that this is mostly what happens, as when λ = 1 only a little change is observed in SPIDEr scores.

An interesting thing to note here, however, is that some change still do occur as using just the loss ℒ_reg as it still improves the performance a little on 𝔻_𝑛𝑒𝑤 while decreasing it a little on 𝔻_ori. The values of λ < 1 also follows the expected results. Table 2 clearly shows, that when lower values of λ are used, the SPIDEr score on 𝔻_new generally becomes higher, while resulting in lower values on 𝔻_ori.

Table 2 shows, that the best performance on 𝔻_ori is achieved using B = 12 and λ = 0.8, while the highest SPIDEr score on 𝔻_new uses values B = 4 and λ = 0.7. This follows the before mentioned behaviour related to B and λ, especially with regards to B. Considering the above mentioned expectations, the 𝔻_new model where SPIDEr = 0.239 follows the trend that combination of highest frequency of updates with least amount of weight on ℒ_reg results in the best performance in new domain. However, the hyper-parameter setup that results in the best 𝔻_ori model slightly differs from the expected case of achieving best model when ignoring the ℒ_new completely when λ = 1. As shown in Table 2, the

highest performing 𝔻_new is instead achieved with λ = 0.8 and B = 12. This is rather sur-prising as this indicates some degree of enhancement even compared to the baseline 𝑊𝑇_{𝑐𝑙−𝑎𝑢}, with SPIDEr = 0.186, while also improving significantly on 𝔻_new, with achieved SPIDEr of 0.157.

The overall results show that applying the suggested method to a AAC model clearly achieves some degree of continual learning, with balancing the performance between 𝔻_𝑛𝑒𝑤 and 𝔻_𝑜𝑟𝑖.

8 Conclusion

In the thesis and the related paper, the first case study of Continual Learning method applied to a AAC model was introduced. The method used was Learning without Forget-ting, that is a simple, yet effective regularization based continual learning method. The method was applied to a AAC model called Wavetransformer, that has previously achieved state-of-the-art performance in the AAC tasks.

In order to evaluate the performance of the presented method two freely available AAC datasets were utilized: Clotho and AudioCaps. The used base model has been originally trained using Clotho, while a simulated data stream from AudioCaps was used to further train the model. The goal was to achieve a model, that would learn knowledge from the new dataset, while being able to minimize the catastrophic forgetting that occurs during training. The evaluation of the model was done by comparing the results using the pre-sented method, to three different baseline scenarios consisting of original results of the model as well as naïvely fine-tuned models. The evaluation score that was used for the comparison, is a metric commonly used in AAC research called SPIDEr.

The presented experiment in the thesis showed that applying LwF to an AAC model does achieve continually learning model that learns to perform better in the new domain of data, while degrading less in the original domain than with simple fine-tuning.

This thesis could be considered as a proof of concept with regards to the effectiveness of Continual Learning in AAC. While the LwF method does achieve some degree of con-tinual learning, the method is quite simple and more sophisticated methods introduced during last few years could provide even better results. Furthermore, in the presented case while being different datasets, the domains within do overlap. Thus, there are a few dis-tinct points that could be further researched in the future. First, as there are plenty of more recent Continual Learning methods available, applying different methods to AAC models could prove to be effective. Second, in order to more effectively evaluate the effectiveness of continual learning, datasets from more distinct domains could be used. Finally, this thesis showed a case where only the common words between datasets were considered, thus leaving a need for a method that can also be used to learn completely new classes in a continual manner as well.

References

[1] Adavanne, Sharath, Pasi Pertilä, and Tuomas Virtanen. "Sound event detection us-ing spatial features and convolutional recurrent neural network." 2017 IEEE interna-tional conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017.

[2] Aljundi, Rahaf, et al. "Gradient based sample selection for online continual learn-ing." arXiv preprint arXiv:1903.08671 (2019).

[3] Allen, Jont B., and Lawrence R. Rabiner. "A unified approach to short-time Fourier analysis and synthesis." Proceedings of the IEEE 65.11 (1977): 1558-1564.

[4] Allred, Jason M., and Kaushik Roy. "Controlled forgetting: Targeted stimulation and dopaminergic plasticity modulation for unsupervised lifelong learning in spiking neural networks." Frontiers in neuroscience 14 (2020): 7.

[5] Amato, Filippo, et al. "Artificial neural networks in medical diagnosis." (2013): 47-58.

[6] Anderson, Peter, et al. "Spice: Semantic propositional image caption evaluation."

European conference on computer vision. Springer, Cham, 2016.

[7] Apicella, Andrea, et al. "A survey on modern trainable activation functions." Neural Networks (2021).

[8] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine transla-tion by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).

[9] Berg, Jan, and Konstantinos Drossos. "Continual Learning for Automated Audio Cap-tioning Using The Learning Without Forgetting Approach." arXiv preprint

arXiv:2107.08028 (2021).

[10] Bishop, Christopher M. Neural networks for pattern recognition. Oxford university press, 1995.

[11] Bracewell, Ronald Newbold, and Ronald N. Bracewell. The Fourier transform and its applications. Vol. 31999. New York: McGraw-Hill, 1986.

[12] Buciluǎ, Cristian, Rich Caruana, and Alexandru Niculescu-Mizil. "Model compres-sion." Proceedings of the 12th ACM SIGKDD international conference on Knowledge dis-covery and data mining. 2006.

[13] Chaudhari, Sneha, et al. "An attentive survey of attention models." ACM Transac-tions on Intelligent Systems and Technology (TIST) 12.5 (2021): 1-32.

[14] Chaudhry, Arslan, et al. "Efficient lifelong learning with a-gem." arXiv preprint arXiv:1812.00420 (2018).

[15] Chaudhry, Arslan, et al. "Riemannian walk for incremental learning: Understanding forgetting and intransigence." Proceedings of the European Conference on Computer Vision (ECCV). 2018.

[16] Chen, Zhiyuan, and Bing Liu. "Lifelong machine learning." Synthesis Lectures on Ar-tificial Intelligence and Machine Learning 12.3 (2018): 1-207.

[17] Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-de-coder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).

[18] Cho, Kyunghyun, et al. "On the properties of neural machine translation: Encoder-decoder approaches." arXiv preprint arXiv:1409.1259 (2014).

[19] Ciregan, Dan, Ueli Meier, and Jürgen Schmidhuber. "Multi-column deep neural networks for image classification." 2012 IEEE conference on computer vision and pat-tern recognition. IEEE, 2012.

[20] Dai, Zihang, et al. "CoAtNet: Marrying Convolution and Attention for All Data Sizes." arXiv preprint arXiv:2106.04803 (2021).

[21] da Silva, Ivan Nunes. et al. Artificial Neural Networks A Practical Course . 1st ed.

2017. Cham: Springer International Publishing, 2017. Web.

[22] Davis, Steven, and Paul Mermelstein. "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences." IEEE transac-tions on acoustics, speech, and signal processing 28.4 (1980): 357-366.

[23] Delange, Matthias, et al. "A continual learning survey: Defying forgetting in classi-fication tasks." IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

[24] Draelos, Timothy J., et al. "Neurogenesis deep learning: Extending deep networks to accommodate new classes." 2017 International Joint Conference on Neural Net-works (IJCNN). IEEE, 2017.

[25] Drossos, Konstantinos, Samuel Lipping, and Tuomas Virtanen. "Clotho: An audio captioning dataset." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.

[26] Drossos, Konstantinos, Sharath Adavanne, and Tuomas Virtanen. "Automated au-dio captioning with recurrent neural networks." 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2017.

[27] Emmert-Streib, Frank, et al. "An introductory review of deep learning for predic-tion models with big data." Frontiers in Artificial Intelligence 3 (2020): 4.

[28] Farquhar, Sebastian, and Yarin Gal. "Towards robust evaluations of continual learning." arXiv preprint arXiv:1805.09733 (2018).

[29] Fawaz, Hassan Ismail, et al. "Deep learning for time series classification: a review."

Data mining and knowledge discovery 33.4 (2019): 917-963.

[30] French, Robert M. "Catastrophic forgetting in connectionist networks." Trends in cognitive sciences 3.4 (1999): 128-135.

[31] Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE International Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP). IEEE, 2017.

[32] Goldberg, Y., and G. Hirst. "Neural Network Methods in Natural Language Pro-cessing. Morgan & Claypool Publishers, San Rafael (2017)."

[33] Gu, Jiuxiang, et al. "Recent advances in convolutional neural networks." Pattern Recognition 77 (2018): 354-377.

[34] Haykin, Simon. Neural networks - A comprehensive foundation, 2nd edition, Pren-tice-Hall (1999).

[35] Hershey, Shawn, et al. "CNN architectures for large-scale audio classification."

2017 ieee international conference on acoustics, speech and signal processing (icassp).

IEEE, 2017.

[36] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015).

[37] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.

[38] Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).

[39] Hsu, Yen-Chang, et al. "Re-evaluating continual learning scenarios: A categoriza-tion and case for strong baselines." arXiv preprint arXiv:1810.12488 (2018).

[40] Janocha, Katarzyna, and Wojciech Marian Czarnecki. "On loss functions for deep neural networks in classification." arXiv preprint arXiv:1702.05659 (2017).

[41] Kemker, Ronald, and Christopher Kanan. "Fearnet: Brain-inspired model for incre-mental learning." arXiv preprint arXiv:1711.10563 (2017).

[42] Kim, Chris Dongjoo, et al. "AudioCaps: Generating captions for audios in the wild."

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019.

[43] Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural networks."

Proceedings of the national academy of sciences 114.13 (2017): 3521-3526.

[44] Krenker, Andrej, Janez Bešter, and Andrej Kos. "Introduction to the artificial neural networks." Artificial Neural Networks: Methodological Advances and Biomedical Appli-cations. InTech (2011): 1-18.

[45] Kumar, Siddharth Krishna. "On weight initialization in deep neural networks."

arXiv preprint arXiv:1704.08863 (2017).

[46] Lawrence, Steve, C. Lee Giles, and Ah Chung Tsoi. "Lessons in neural network training: Overfitting may be harder than expected." AAAI/IAAI. 1997.

[47] LeCun, Yann, et al. "Gradient-based learning applied to document recognition."

Proceedings of the IEEE 86.11 (1998): 2278-2324.

[48] Li, Yuhong, Xiaofan Zhang, and Deming Chen. "Csrnet: Dilated convolutional neu-ral networks for understanding the highly congested scenes." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[49] Li, Zhizhong, and Derek Hoiem. "Learning without forgetting." IEEE transactions on pattern analysis and machine intelligence 40.12 (2017): 2935-2947.

[50] Liu, Pengfei, Xipeng Qiu, and Xuanjing Huang. "Recurrent neural network for text classification with multi-task learning." arXiv preprint arXiv:1605.05101 (2016).

[51] Liu, Shiwei, et al. "Sparse evolutionary deep learning with over one million artifi-cial neurons on commodity hardware." Neural Computing and Applications 33.7 (2021): 2589-2604.

[52] Liu, Siqi, et al. "Improved image captioning via policy gradient optimization of spi-der." Proceedings of the IEEE international conference on computer vision. 2017.

[53] Lomonaco, Vincenzo, and Davide Maltoni. "Core50: a new dataset and benchmark for continuous object recognition." Conference on Robot Learning. PMLR, 2017.

[54] Lomonaco, Vincenzo, et al. "Avalanche: an End-to-End Library for Continual Learn-ing." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni-tion. 2021.

[55] Lopez-Paz, David, and Marc'Aurelio Ranzato. "Gradient episodic memory for con-tinual learning." Advances in neural information processing systems 30 (2017): 6467-6476.

[56] Maltoni, Davide, and Vincenzo Lomonaco. "Continuous learning in single-incre-mental-task scenarios." Neural Networks 116 (2019): 56-73.

[57] McCloskey, Michael, and Neal J. Cohen. "Catastrophic interference in connection-ist networks: The sequential learning problem." Psychology of learning and motivation.

Vol. 24. Academic Press, 1989. 109-165.

[58] Brian McFee, Alexandros Metsai, Matt McVicar, Stefan Balke, Carl Thomé, Colin Raffel, Frank Zalkow, Ayoub Malek, Dana, Kyungyun Lee, Oriol Nieto, Dan Ellis, Jack Mason, Eric Battenberg, Scott Seyfarth, Ryuichi Yamamoto, viktorandreevichmorozov, Keunwoo Choi, Josh Moore, … Thassilo. (2021). librosa/librosa: 0.8.1rc2 (0.8.1rc2). Ze-nodo. https://doi.org/10.5281/zenodo.4792298

[59] Mishkin, Dmytro, and Jiri Matas. "All you need is a good init." arXiv preprint arXiv:1511.06422 (2015).

[60] Mundt, Martin, et al. "A wholistic view of continual learning with deep neural net-works: Forgotten lessons and the bridge to active and open world learning." arXiv pre-print arXiv:2009.01797 (2020).

[61] Nawi, Nazri Mohd, Walid Hasen Atomi, and Mohammad Zubair Rehman. "The ef-fect of data pre-processing on optimized training of artificial neural networks." Proce-dia Technology 11 (2013): 32-39.

[62] Nwankpa, Chigozie, et al. "Activation functions: Comparison of trends in practice and research for deep learning." arXiv preprint arXiv:1811.03378 (2018).

[63] O'Shea, Keiron, and Ryan Nash. "An introduction to convolutional neural net-works." arXiv preprint arXiv:1511.08458 (2015).

[64] Parisi, German I., et al. "Continual lifelong learning with neural networks: A re-view." Neural Networks 113 (2019): 54-71.

[65] Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. "On the difficulty of training recurrent neural networks." International conference on machine learning. PMLR, 2013.

[66] Purwins, Hendrik, et al. "Deep learning for audio signal processing." IEEE Journal of Selected Topics in Signal Processing 13.2 (2019): 206-219.

[67] Rebuffi, Sylvestre-Alvise, et al. "icarl: Incremental classifier and representation learning." Proceedings of the IEEE conference on Computer Vision and Pattern Recogni-tion. 2017.

[68] Rosenblatt, Frank. "The perceptron: a probabilistic model for information storage and organization in the brain." Psychological review 65.6 (1958): 386.

[69] Sak, Hasim, Andrew W. Senior, and Françoise Beaufays. "Long short-term memory recurrent neural network architectures for large scale acoustic modeling." (2014).

[70] Schmidhuber, Jürgen. "Deep learning in neural networks: An overview." Neural networks 61 (2015): 85-117.

[71] Shen, Jonathan, et al. "Natural tts synthesis by conditioning wavenet on mel spec-trogram predictions." 2018 IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP). IEEE, 2018.

[72] Shin, Hanul, et al. "Continual learning with deep generative replay." arXiv preprint arXiv:1705.08690 (2017).

[73] Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." The journal of machine learning research 15.1 (2014): 1929-1958.

[74] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

[75] Svozil, Daniel, Vladimir Kvasnicka, and Jiri Pospichal. "Introduction to multi-layer

In document Continual Learning In Automated Audio Captioning (sivua 51-62)