Regularization Method-3: Kernel Regularization

5.2 Results for Dataset-2 (CIFAR-10)

5.2.4 Regularization Method-3: Kernel Regularization

For Kernel regularization we used the value of L1,L2=0.001. Figure 15 shows the graphical results for the train and validation accuracies and categorical cross entropies for 20 number of epochs on GPU. Mean test error for kernel regularization is 1.2 as shown in table 5.

Figure 16 shows box plot for the comparison of the percentage of test errors over 10-fold CV for different regularization methods applied to dataset-2 (CIFAR 10) with the baseline model. The box plot shows clearly that the results for the test errors for the baseline model and the batch normalization are close and there is not much difference in terms of the model

Figure 14. Train accuracy (top left), train CE (top right), validation accuracy (bottom left), validation CE (bottom right) for 10-fold cross validation for dropout

Figure 15. Train accuracy (top left), train CE (top right), validation accuracy (bottom left), validation CE (bottom right) for 10-fold cross validation for kernel regularization

Table 5. The results of the experiments for dataset-2(CIFAR-10), mean values of validation accuracy, validation cross entropy and test errors are taken from 10 models for each method that were created with 10-fold cross validation

Methods Validation accuracy validation cross entropy test errors

Baseline-no regularization 76% 1.4 1.5

CNN +Batch normalization 78% 1.15 1.49

CNN+Dropout 82% 0.58 0.59

CNN+kernel regularization 78.5% 1.18 1.2

Figure 16. Comparison of test errors for different methods of regularization for Dataset 2 performance. However, the dropout method and kernel regularization methods applied to CIFAR-10 dataset shows variation of results. It is evident from the box plot that dropout method performed better than other methods.

The computation times taken by each method are calculated. Table 6 shows the comparison for the running times of different regularization methods for dataset-2. It is observed that the kernel regularization has taken longest time to run the computations than the other methods.

While batch normalization took smaller time. It is in accordance with the literature that the computation time for dropout is longer than normal standard neural network (Srivastava et al. 2014). One main reason for this time increment is that, the parameter updates are very

noisy.

Table 6. Comparison of time for training for different methods of Regularization For 25 epochs, with 10-fold CV for Dataset 1(cats vs dogs)

Method Total time of training Baseline 1 h (18 sec per epoch/fold) Batch normalization 1h,15min (22.5 sec/epoch/fold)

Dropout 1h,30min (27sec/epoch/fold) Kernel regularization 1h,40min (30sec/epoch/fold)

6 Discussion

Convolutional neural networks are excellent deep learning systems with considerable num-ber of parameters. Multiple nonlinear hidden layers make them efficient models, which learn the complicated relationships between inputs and outputs (Goodfellow, Bengio, and Courville 2016). But if the training data is limited, these networks starts to overfit. They do well on the training data but worse on the test data. So deep learning models are powerful in complex function representation but difficult to train. Many techniques are used in deep learning to reduce the test errors at the cost of increased training error. These techniques are termed as regularization (Goodfellow, Bengio, and Courville 2016). There are many methods to reduce overfitting in convolutional neural networks, like dropout (G.E. Hinton et al. 2012), dropconnect (Wan et al. 2013), dropall (Frazão and Alexandre 2014), curriculum dropout (Morerio et al. 2017), stochastic pooling (Zeiler 2012), batch normalization (Ioffe and Szegedy 2015), introduction of weight penalties (Tibshirani 1996) (Tibshirani, 1996), and cutout (DeVries and Taylor 2017) as explained in chapter 2.

In deep learning, the regularization techniques are based on increased bias and reduced vari-ance. In case of overfitting, the variance dominated the bias. So, the goal of regularization in case of a complex deep model is to minimize the generalization error, so that it fits in the practical deep learning scenarios. We have looked deeper into the existing methods for regularizing convolutional neural networks and observed that the regularization is a crucial step towards the generalization of errors of deep models. Regularization can be carried out in many ways. There are methods of regularization through data e.g, random dropout (Bouthillier et al. 2015), Bayesian dropout (Maeda 2014), and batch normalization (Ioffe and Szegedy 2015) etc. Similarly there are methods of regularization through network ar-chitecture like stochastic pooling (Zeiler and Fergus 2013), through optimization like fast dropout (Wang and JaJa 2013). So regularization is a vast subarea of research in the field of deep learning. With the rising fame and need of convolutional neural networks, the need for better regularization methods is inevitable.

We tried to focus on the regularization methods for convolutional neural networks and present a picture of existing methods which are recently developed and compared them. We also

ex-perimentally compared three regularization methods on the binary and categorical image datasets as explained in chapter 3. On both datasets, we tested batch normalization, kernel regularization and dropout methods. Dropout has performed well as indicated in literature that dropout is a strong regularization technique for deep learning. Our experiments con-firm the literature study. For the model performance estimation and generalization of test errors, we cross validated our data with 10-fold cross validation. We have observed in our results great variation of sensitivity of data, when applied to different regularization tech-niques. With kernel regularization, both the datasets showed great variation of sensitivity.

We observed peaks in the curves as shown in figure 10 (bottom left and bottom right) in case of Kernel regularization on dataset-1. While in case of dataset-2, the curves are more stable as shown in figure 15 ((bottom left and bottom right) in case of Kernel Regularization on dataset-2. The model performance is calculated in terms of validation accuracies and cross entropies as discussed in chapter 5. The models with different methods are compared by plotting test errors for each dataset. The computation times for different methods are calcu-lated. The results have shown that the BN is the fastest in computation, results in fast training process (Ioffe and Szegedy 2015) as shown in table 4 and table 6 in chapter 5.

We do not aim to improve or compare the state-of-the-art results on any of the data but to study about the impact of existing regularization methods on convolutional neural networks and comparing some common methods by their practical implementation on different image datasets. To the best of our knowledge, there is not much work done similar to ours. There is one comparison study for different regularization methods for ImageNet classification using deep convolutional neural network (Smirnov, Timoshenko, and Andrianov 2014) which has implemented dropout, dropconnect, and data augmentation for regularization. The results of the study shows that the dropout is better than dropconnect, which is also in accordance with our results (that dropout performs well). However, in the field of regularization we have noticed that much of the work is done to develope new methods of regularization as reported in chapter 2, for deep learning techniques.

Batch normalization is proved to be a strong regularization method however, it depends on the mini-batch sizes. We have observed that BN addresses the problem of covariate shift by normalizing the features by the mean and variance computed within a mini-batch, and

allows high learning rates. It is also found that applying batch normalization before the nonlinear activations is of key importance in order to accelerate the training and achieve higher accuracies.

On the other hand dropout is the most effective of all and generated better results on our experimental setup. Although the results are close and there is not huge difference found in results among different methods of regularization. But the achieved results provide the empirical confirmation of the literature study which shows that the dropout is an effective and strong regularization method for convolutional neural networks and outperforms other methods. These results can be improved by fine tuning the hyperparameters.

7 Conclusion

On the basis of the literature study and the empirical work presented in this thesis, we can easily draw some conclusions and provide some practical recommendations along with fu-ture research directions.

From the literature study it is concluded that Convolutional neural networks (CNNs) are the state-of-the-art deep learning technique, producing astonishing results in Artificial intelli-gence (AI) research. Researchers of the field are now inclined more towards the development of regularization methods to enhance the performance of CNNs. For that reason, the study of the existing regularization methods and development of new ones has emerged as a key research focus in the field of AI in recent years.

It is evident from the literature that dropout and batch normalization are two potent regular-ization techniques, which have been used effectively for regularizing convolutional neural networks in recent years. Plenty of new regularization methods have been developed from dropout and batch normalization as discussed in detail in chapter 2 of the thesis. How-ever, the experimental comparison of the above mentioned methods concludes that dropout performs better than batch normalization. Our experimental results show that the dropout (Geoffrey Hinton et al. 2012) is a powerful and effective method of regularization in Con-volutional neural networks as compared to the kernel regularization and batch normalization as shown in figure 11 and figure 16 in chapter 5. The results presented in chapter 5 are in accordance with the previous research as research literature shows that dropout is the most effective way to reduce overfitting in neural networks (Srivastava et al. 2014) and (Wager, Wang, and Liang 2013). We have observed that dropout is not only computationally cheap, but it also does not limit the type of the model as well. As we have experimented both for binary classification and categorical classification and we have seen that dropout worked bet-ter on both setups. We have observed that regularization is the need of the network in bet-terms of the model performance, especially when we use high dimensional data with with large number of parameters, the regularization helps in fine tuning the model to make it perform well on unseen data and prohibit the model being biased.

Computation time for different methods (dropout, batch normalization and kernel regular-ization) as given in table 4 and table 6 in chapter 5 of the thesis clearly shows that the kernel regularization took longer time to compute and batch normalization took shorter time to compute. In case of dropout, the main drawback is the increased time consumption, that is the network with dropout takes 2-3 times longer time for training than the normal standard neural network (Srivastava et al. 2014). One main reason for this time increment is that, the parameter updates are very noisy. The gradients being computed are not the gradients of the architecture that will be used at test time. Therefore, training takes a long time. But if the noise is reduced, training time can be reduced. Therefore, with high dropout, we can reduce overfitting at the cost of longer training time. So we may conclude that BN is fast in training the CNNs as compared to dropout and kernel regularization.

Convolutional neural networks have gained noticeable popularity because of handling big and complex data (Krizhevsky and Hinton 2012), that is why most of the research focuses on the methods and techniques to reduce overfitting problem to train big models (Geoffrey Hinton et al. 2012). By reviewing the literature, it is observed that some work is done on ImageNet classification to compare different regularization methods (Dropout, drop connect, advanced data augmentation) (Smirnov, Timoshenko, and Andrianov 2014), which has also showed that among the three methods investigated, dropout has performed better. CIFAR-10 dataset is used to study the weight decay regularization in Adam (Loshchilov and Hutter 2017).This study has investigated the improvement of pooling method for regularized convo-lutional neural networks (Zhang 2017) and dropout training for convoconvo-lutional neural network (Wu and X. 2015). It encourages our work and provides a direction for the future work. As it is noted that not much comparison studies are done among different methods of regulariza-tion, so for future work it would be interesting to compare methods like dropconnect (Wan et al. 2013), drop part (Tomczak 2013), dropall (Frazão and Alexandre 2014), Shakeout (Kang, Li, and Tao 2016), Shake-Shake regularization (Gastaldi 2017) and compare them with dropout method.

Practical recommendation: We ran the experiments on two different data sets, one was of binary classification and the other was of categorical classification. We used both CPU and GPU for our work and found that working with GPU is much time saving than CPU. After

spending many days on cats vs dogs experiments on CPU, we decided to save our time and ran second set of experiments for CIFAR-10 dataset on GPU. So, for the deep learning re-search, graphical processing unit (GPU) is wiser to choose for running computations because of its power, memory and speed.

Bibliography

Bouthillier, X., K. Konda, P. Vincent, and R. Memisevic. 2015. “Dropout as data augmenta-tion”.ArXiv e-prints(). arXiv:1506.08700 [stat.ML].

Cire¸san, D., U. Meier, and J. Schmidhuber. 2012. “Multi-column Deep Neural Networks for Image Classification”.ArXiv e-prints(). arXiv:1202.2745 [cs.CV].

Collobert, R., J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. 2011. “Nat-ural Language Processing (almost) from Scratch”. ArXiv e-prints (). arXiv: 1103 . 0398 [cs.LG].

Danaee, Padideh, Reza Ghaeini, and David Hendrix. 2017. “A Deep Learning Approach for Cancer Detection and Relevant Gene Identification”.Pacific Symposium on Biocomputing.

Pacific Symposium on Biocomputing22:219–229.

DeVries, T., and G. W. Taylor. 2017. “Improved Regularization of Convolutional Neural Networks with Cutout”.ArXiv e-prints(). arXiv:1708.04552 [cs.CV].

Du, Juan. 2018. “Understanding of Object Detection Based on CNN Family and YOLO”.

Journal of Physics: Conference Series.

Frazão, X.F., and L.A. Alexandre. 2014. “DropAll: Generalization of Two Convolutional Neural Network Regularization Methods”. In International Conf. on Image Analysis and Recognition,volume LNCS 8814, 282–289.

Gal, Y., and Z. Ghahramani0. 2016. “Dropout as a Bayesian approximation: Representing model uncertainity in deep learning”. In proceedings of the International Conference on Machine Learning (ICML)48:1050–1059.

Gastaldi, X. 2017. “Shake-Shake regularization”. ArXiv e-prints(). arXiv: 1705 . 07485 [cs.LG].

Gitman, I., and B. Ginsburg. 2017. “Comparison of batch and weight normalization algo-rithms for large-scale image classification”.ArXiv e-prints.arXiv:1709.08145.

Goodfellow, I.J., Y. Bengio, and A. Courville. 2016. “Deep Learning”.MIT Press.

Graham, B. 2014. “Fractional Max-Pooling”.CoRR.

He, K., X. Zhang, S. Ren, and J. Sun. 2015. “Deep Residual Learning for Image Recogni-tion”.ArXiv e-prints(). arXiv:1512.03385 [cs.CV].

Hinton, G.E., N. Srivastava, A. Krizhevsky, and I. Sutskever. 2012. “Improving Neural Net-works by preventing co-adaptation of feature detectors”.CoRR.arXiv:1207.0580.

Hinton, Geoffrey, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, et al. 2012. “Deep Neural Networks for Acoustic Modeling in Speech Recognition”.Signal Processing Magazine.

Huang, L., X. Liu, B. Lang, A. W. Yu, W. Wang, and B. Li. 2017. “Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks”.CoRR.arXiv:1709.06079.

Ioffe, S., and C. Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.”CoRR:448–456. arXiv:1502.03167.

Jaitly, N., and G. E. Hinton. 2013. “Vocal tract length perturbation (VTLP) improves speech recognition.”

Jiang, Guo-Qing, Jing Xu, and Jun Wei. 2018. “A deep learning algorithm of neural net-work for the parameterization of Typhoon-Ocean-Feedback in Typhoon Forecast Models”.

Geophysical Research Letters. Advancing Earth and Space Science.

Kamilaris, A., X. Francesc, and P. Boldu. 2018. “Deep learning in agriculture: A survey”.

Computers and electronics in agriculture147:70–90.

Kang, G., J. Li, and D. Tao. 2016. “Shakeout: A new regularized deep neural network training scheme”.In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16):1751–1757.

Kriesel, D. 2017. “A brief introduction to neural networks”.Edition zeta2.

Krizhevsky, A., and G. E. Hinton. 2012. “Learning Multiple layers of features from tiny images”, 1.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “ImageNet Classification with Deep Convolutional Neural Networks”. InAdvances in Neural Information Processing Systems 25,edited by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, 1097–

1105. Curran Associates, Inc.http://papers.nips.cc/paper/4824-imagenet -classification-with-deep-convolutional-neural-networks.pdf. Laarhoven, T. V. 2017. “L2 regularization versus batch and weight normalization”. ArXiv e-prints.arXiv:1706.05350.

Lawrence, S., C. L. Giles, A. C. Tsoi, and A. D. Back. 1997. “Face Recognition:A Convolu-tional Neural-Network Approach”.IEEE Transactions on Neural Networks8(1):98–113.

LeCun, Y., L. Bottou, Y. Bengio0, and P. Haffner. 1998. “Gradient Based Learning Applied to Document Recognition”.Proceedings of IEEE86:2278–2324.

Lei Ba, J., J. R. Kiros, and G. E. Hinton. 2016. “Layer Normalization”. ArXiv e-prints().

arXiv:1607.06450 [stat.ML].

Liao, Z., and G. Carneiro. 2015. “On the Importance of Normalisation Layers in Deep Learning with Piecewise Linear Activation Units”.ArXiv e-prints(). arXiv:1508.00330 [cs.CV].

Liu, B., Y. Liu, and K. Zhou. 2014. “Image classification for dogs and cats”.

Loshchilov, I., and F. Hutter. 2017. “Fixing weight decay regularization in Adam.” CoRR.

arXiv:1711.05101.

Lu, D., and Q Weng. 2005. “Survey of image classification methods and techniques for im-proving image classification performance”.International journal of remote sensing28:823–

870.

Ma, X., Z. Dai, Z. He, J. Na, Y. Wang, and Y. Wang. 2017. “Learning Traffic as Images: A Deep Convolutional Neural Network for Large-Scale Transportation Network Speed Predic-tion”.ArXiv e-prints(). arXiv:1701.04245 [cs.LG].

Maeda, S.-i. 2014. “A Bayesian encourages dropout”.ArXiv e-prints(). arXiv:1412.7003 [cs.LG].

Marsolek, C. J., and E. D. Burgund. 1997. “Cerebral Asymmetries in sensory and perceptual processing”.Advances in Psychology, 1st Edition.

Min, S., B. Lee, and S. Yoon. 2017. “Deep learning for bioinformatics”. doi:PMID:27473 064.

Morerio, P., J. Cavazza, R. Volpi, R. Vidal, and V. Murino. 2017. “Curriculum Dropout”.

CoRR abs/1703.06229. arXiv: 1703 . 06229. http : / / arxiv . org / abs / 1703 . 06229.

Peng, M., C. Wang, T. Chen, and G. Liu. 2016. “NIRFaceNet: A Convolutional Neural Net-work for Near-Infrared Face Identification”.Information.

Pierson, H. A., and M. S. Gashler. 2017. “Deep learning in robotics: A review of recent research”.Advanced Robotics31:821–835.

Polson, N., and V. Sokolov. 2017. “Deep learning for short term traffic flow prediction.”

ArXiv e-prints.arXiv:1604.04527v3.

Redmon, J., S. K. Divvala, R. B. Girshick, and A. Farhadi. 2015. “You Only Look Once:

Unified, Real-Time Object Detection”.CoRR.

Rosenblatt, F. 1958. “The Perceptron: A Probabilistic model for Information Storage and organization in the Brain”.Physiological Review65:65–386.

Salimans, T., and D. P. Kingma. 2016. “Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks”.ArXiv e-prints(). arXiv:1602.07868 [cs.LG].

Santurkar, S., D. Tsipras, A. Ilyas, and A. Madry. 2018. “How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift)”.ArXiv e-prints. arXiv:

1805.11604.

Schmidhuber, J. 2015. “Deep learning in neural networks: An Overview”.Neural Networks 61:85–117.

Schuurmans, D., and M. Zinkevich. 2016. “Deep learning games”.NIPS.

Sietsma, J., and R. Dow. 1991. “Creating artificial neural networks that generalize”.Neural Networks4, issue 1:67–79.

Simonyan, K., and A. Zisserman. 2015. “Very deep convolutional networks for large scale image recognition”.ICLR.arXiv:1409.1556.

Smirnov, E. A., D. M. Timoshenko, and S. N. Andrianov. 2014. “Comparison of regulariza-tion methods for ImageNet classificaregulariza-tion with deep convoluregulariza-tional neural networks”.AASRI Procedia:89–94.

Srivastava, N., G. E. Hinton, A. Krizhevsky, and I. Sutskever. 2014. “A simple way to prevent neural network to prevent overfitting”.Machine Learning Research15:1929–1958.

Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. “Going Deeper with Convolutions”.CoRRabs/1409.4842.

Tibshirani, R. 1996. “Regression shrinkage and selection via the lasso”. Royal Statistical Society. Series B (Methodological):267–288.

Tomczak, J. M. 2013. “Prediction of breast cancer recurrence using Classification Restricted Boltzmann Machine with Dropping”. arXiv:1308.6324.

Tripathy, N., and A. Jadeja. 2015. “Stochastically reducing overfitting in deep neural net-works using dropout”. In Internatioal Jornal of Innovative Science,Engineering and Tech-nology (IJISET)2 Issue 5.

Wager, S., S. Wang, and P. S. Liang. 2013. “Dropout training as adaptive regularizationau-thor”.In Advances in Neural Information Processing Systems (NIPS):351–359.

Wan, L., M. Zeiler, S. Zhang, Y. L. LeCun, and R. Fergus. 2013. “Regularization of neu-ral network using DropConnect”. In Proceedings of the 30th International Conference on Machine Learning (ICML)28:1058–1066.

Wang, Q., and J. JaJa. 2013. “From maxout to Channel-Out: Encoding information on sparse pathways”.CoRR.arXiv:1312.1909.

Wang, Sida, and Christopher D. Manning. 2012. “Baselines and Bigrams: Simple, Good Sentiment and Topic Classification”. InProceedings of the 50th Annual Meeting of the Asso-ciation for Computational Linguistics: Short Papers - Volume 2,90–94. ACL ’12. Jeju Island, Korea: Association for Computational Linguistics.http://dl.acm.org/citation.

cfm?id=2390665.2390688.

Wang, Xiaoguang, Xuan Liu, Stan Matwin, Nathalie Japkowicz, and Hongyu Guo. 2014.

“A multi-view two-level classification method for generalized multi-instance problems”. In BigData Conference,104–111. IEEE.

Warburton, K. 2003. “Deep learning and education for sustainability”.International journal of sustainability in higher education:44–56.

Wen, W., C. Wu, W. Wang, Y. Chen, and H. Li. 2016. “Learning structured sparsity in deep neural networks”. In Advances in Neural Information Processing Systems (NIPS): 2074–

2082.

Wu, G., Shen D., and M. Sabuncu. 2016. “Machine learning and medical imaging”. 1st edition. Academic Press.

Wu, H., and Gu X. 2015. “Towards dropout training for convolutional neural network”. Neu-ral Networks.

Wu, Y., and K. He. 2018. “Group Normalization”.CoRR.arXiv:1803.08494.

Yu, K., W. Xu, and Y. Gong. 2009. “Deep learning with kernel regularization for visual recognition”.In Advances in Neural Information Processing Systems (NIPS)21:1889–1896.

Zeiler, M. D. 2012. “ADADELTA: An Adaptive Learning Rate Method”. arXiv: 1212 .

In document The Impact of Regularization on Convolutional Neural Networks (sivua 47-63)