The Impact of Regularization on Convolutional Neural Networks

(1)

Khaula Zeeshan

The Impact of Regularization on Convolutional Neural Networks

Web Intelligence and Service Engineering in Information Technology July 17, 2018

University of Jyväskylä

(2)

Author:Khaula Zeeshan

Contact information: khaula.k.zeeshan@student.jyu.fi

Supervisor: Pekka Neittaanmäki, Sami Äyrämö

Title:The Impact of Regularization on Convolutional Neural Networks

Työn nimi: Syväoppiminen kuva-analyysiin ja regularisoinnin vaikutus konvolutiivisissa neuroverkoissa

Project: Web Intelligence and Service Engineering Study line: Web Intelligence and Service Engineering Page count:63+0

Abstract: Deep learning has become the most popular class of machine learning family in recent times. Convolutional neural networks is one of the most popular deep learning architecture for solving many complicated and sophisticated problems like image classification, image recognition, and image detection. However, deep learning techniques faces overfitting problems, which is a hindrance to the model performance. Since convolutional neural networks are outperforming in the field of computer vision, so the need for new regularization techniques to reduce overfitting issues in convolutional neural networks is inevitable. This thesis work provides a peek into the recently developed regularization methods particularly for convolutional neural networks and generally for other deep learning techniques. This thesis also showcases the comparison of most commonly used regularization methods (dropout, batch normalization, kernel regularization) by training convolutional neural networks for image classification on two image datasets (CIFAR-10 and Kaggle‘s Cat vs Dog). Each model is cross validated by 10- fold cross validation. Empirical results confirms that dropout is a strong regularization technique as compared to the other two methods( batch normalization and L1 and L2 regularization) on both datasets.

Keywords: Artificial intelligence, machine learning, deep learning, convolutional neural network, image classification, regularization, k-fold cross validation, dropout, batch normalization, kernel regularization

(3)

Suomenkielinen tiivistelmä:"Syvä oppiminen (engl. deep learning) on viime aikoina tullut suosituimmaksi koneoppimisen menetelmäksi. Konvoluutio(hermo)verkko on yksi suositu- immista syvän oppimisen arkkitehtuureista monimutkaisiin ongelmiin kuten kuvien luokitteluun, tunnistukseen ja havaitsemiseen. Syvän oppimisen menetelmien toimivuutta hait- taa kuitenkin ylisovittumisongelma. Koska konvoluutioverkot ovat konenäössä tehokkaita, täytyy niiden ylisovittumisen välttämiseksi kehittää uusia menetelmiä. Tämä tutkielma tar- joaa katsauksen lähiaikoina kehitettyihin regularisointimenetelmiin konvoluutioverkkojen ja muiden syvän oppimisen menetelmien tarpeisiin. Tutkielmassa verrataan yleisimmin käytet- tyjä regularisointimenetelmiä (dropout, batch normalization sekä kernel -regularisointi) koulut- tamalla konvoluutioverkko kuvien luokitteluun kahdelle aineistolle (CIFAR-10 ja Kagglen kissa/koira -aineisto). Mallit validoidaan 10-ositetulla ristiinvalidoinnilla. Empiiriset tulok- set varmistavat, että dropout-menettely on muihin kokeiltuihin verrattuna vahva tekniikka molempien aineistojen kohdalla"

Avainsanat: Tekoäly, koneoppiminen, syvä oppiminen, konvoluutiologinen neuroverkko, kuva luokittelu, regularisointi, k-kertainen ristiintarkastus, dropout, batch normalisointi, kernel regularisointi

(4)

Glossary

AI Artificial Intelligence

AUC Area under the curve

ANN Artificial neural network

BN Batch Normalization

CNN Convolutional neural network

CAD Computer aided diagnostic

CV Cross Validation

CE Cross Entropy

CPU Central Processing Unit

DL Deep learning

GPU Graphical Processing Unit

GN Group Normalization

ICS Internal Covariate Shift

LN Layer normalization

ML Machine learning

NLP Natural Language Processing

PCA Principal component analysis

ReLU Rectifier linear unit

WN Weight normalization

(5)

List of Figures

Figure 1. (a) A neural net without dropout, (b) A neural net with dropout (Srivastava et al. 2014) . . . 6 Figure 2. Comparison of dropout and dropconnect (Wan et al. 2013) . . . 7 Figure 3. Graphical representation of early stopping . . . 13 Figure 4. Deep Neural Network, with three hidden layers, input and output layers

(Peng et al. 2016) . . . 21 Figure 5. Sample images from Cats Vs Dogs (left) and CIFAR-10 (right) . . . 28 Figure 6. Test Environment at University of Jyväskylä . . . 31 Figure 7. Train accuracy (top left), train CE (top right), validation accuracy (bottom

left), validation CE (bottom right) for 10-fold cross validation for Baseline . . . 33 Figure 8. Train accuracy (top left), train CE (top right), validation accuracy (bottom

left), validation CE (bottom right) for 10-fold cross validation for BN . . . 34 Figure 9. Train accuracy (top left), train CE (top right), validation accuracy (bottom

left), validation CE (bottom right) for 10-fold cross validation for dropout . . . 35 Figure 10. Train accuracy (top left), train CE (top right), validation accuracy (bottom

left), validation CE (bottom right) for 10-fold cross validation for kernel regularization . . . 36 Figure 11. Comparison of test errors of different regularization methods over 10-fold

CV for dataset-1(Cats Vs Dogs) . . . 37 Figure 12. Train accuracy (top left), train CE (top right), validation accuracy (bottom

left), validation CE (bottom right) for 10-fold cross validation for baseline . . . 38 Figure 13. Train accuracy (top left), train CE (top right), validation accuracy (bottom

left), validation CE (bottom right) for 10-fold cross validation for BN . . . 39 Figure 14. Train accuracy (top left), train CE (top right), validation accuracy (bottom

left), validation CE (bottom right) for 10-fold cross validation for dropout . . . 40 Figure 15. Train accuracy (top left), train CE (top right), validation accuracy (bottom

left), validation CE (bottom right) for 10-fold cross validation for kernel regularization . . . 40 Figure 16. Comparison of test errors for different methods of regularization for Dataset

2 . . . 41

List of Tables

Table 1. Model description for Dataset-1 (Cats Vs Dogs) . . . 29 Table 2. Model description for dataset-2 (CIFAR-10) . . . 30 Table 3. The results of the experiments for dataset-1(Cats Vs Dogs), mean values of

validation accuracy, validation cross entropy and test errors are taken from 10 models for each method that were created with 10-fold cross validation . . . 37 Table 4. Comparison of time for training for different methods of Regularization For

25 epochs, with 10-fold CV for Dataset 1(cats vs dogs) . . . 38

(6)

Table 5. The results of the experiments for dataset-2(CIFAR-10), mean values of validation accuracy, validation cross entropy and test errors are taken from 10 models for each method that were created with 10-fold cross validation . . . 41 Table 6. Comparison of time for training for different methods of Regularization For

25 epochs, with 10-fold CV for Dataset 1(cats vs dogs) . . . 42

(7)

1 Introduction

This chapter presents a short introduction of the topic, research questions, scope and structure of the thesis.

Human eager to build autonomous systems is not a new story of a science fiction series.

From many decades scientists and researchers have spent their days and nights to astonish the world by developing artificially intelligent, autonomous, self driven and self organized systems. Ever since John McCarthy coined the term artificial intelligence in 1955, the human thirst to develop systems which can mimic the human cognition is increasing day by day. As a result, the world has got intelligent machines which can play games like chess and alpha go, can win quizes like Watson did in Jeopardy quiz, can chat and assist in health, business, and marketing, can do surgeries and can drive cars. Now these systems are aiming to achieve human level intelligence, which can be happened in near or far future, nobody knows, but McCarthy has well said;

"As soon as it works, no one calls it AI anymore.”

-John McCarthy

Artificial intelligent computing techniques (ML/DL) are producing astonishing results in solving many complicated problems in different fields of our daily life. Recent researches have shown that deep learning techniques have made pivotal developments in solving tasks like computer vision (Krizhevsky and Hinton 2012), speech recognition (G.E. Hinton et al. 2012) , object recognition (Redmon et al. 2015) and natural language processing (Col- lobert et al. 2011). The key of success for the deep neural networks is hidden in the availability of large labeled datasets and in advancements of fast Graphical Processing Units (GPU).

Ability of automatic feature extraction and modelling high level abstraction in variety of signals such as text, sound and images has brought deep learning to a unique place in the world of AI research. Sophisticated tasks related to computer vision and natural language processing (NLP) have become notoriously hard to solve by the conventional machine learning tchniques. This technological pitfall is successfully vanquished by the state-of-the-art deep learning (DL) tools and techniques.

(10)

Deep learning techniques and architectures such as convolutional neural networks (CNNs) have emerged as a new paradigm in solving sophisticatedly strenuous problems in computer vision, speech recognition and natural language processing. Convolutional neural networks (CNNs or ConvNet) are one of the state-of-the-art deep learning techniques, produced great results in different domains like image classification in (Krizhevsky and Hinton 2012), (Zeiler and Fergus 2013), (Cire¸san, Meier, and Schmidhuber 2012), speech recognition (Zhang et al. 2018), face detection (Lawrence et al. 1997), object detection (Du 2018), and bioacoustics (Smirnov, Timoshenko, and Andrianov 2014). Hence convolutional neural networks have gained a lot of traction in recent years in Artificial intelligence (AI) research community.

For deep neural networks like CNNs , depth is essential for learning internal representations of input data, but at the same time large neural networks suffer from the problem of overfitting. So, the development of new effective regularization techniques is an inevitable need. Many regularization techniques have been proposed like dropout (Geoffrey Hinton et al. 2012) and (Srivastava et al. 2014) , dropconnect (Wan et al. 2013), Standout (Lei Ba, Kiros, and Hinton 2016), and batch normalization (Ioffe and Szegedy 2015). These are just a few examples of regularization techniques developed in recent years.The study and development of new regularization methods and techniques is playing a key role in the improvement of existing deep learning systems.

1.1 Research Questions

We will try to find out the answers to the following questions;

• What are the state-of-the-art regularization techniques to address the problem of overfitting in convolutional neural networks?

• What is the impact of some commonly used regularization techniques (dropout, batch normalization and kernel regularization) on convolutional neural networks and which method performs well?

• Which method is computationally fast and which one is slower on the same dataset?

(11)

1.2 Scope of the Thesis

The improvement or comparison of the state-of-the-art results on any of the data is beyond the scope of the thesis. In this thesis work, we provide detailed review of the regularization techniques developed for improving the performance of convolutional neural networks in recent years. It has been observed in recent years that development of effective regularization methods is an inevitable need to enhance the performance of the convolutional neural networks. This thesis sheds light on the core issue of regularization. We also aim to compare experimentally some of the effective regularization methods which are used most oftenly for regularizing convolutional neural networks. Thesis provides the literature review, followed by the introduction to the theory needed to understand the concept of convolutional neural networks and regularization. Experimental part of the thesis portrays the comparison of three different regularization methods (dropout, batch normalization, and kernel regularization). we opted dropout (Geoffrey Hinton et al. 2012) to apply to our CNN model for two reasons, firstly dropout method is considered to be the effective and most commonly used regularization technique for deep neural networks and secondly it can be easily implemented in convolutional neural networks (CNNs). Second method applied to CNN model in this thesis is batch normalization (Ioffe and Szegedy 2015), which is well suited to convolutional neural networks. Batch normalization has been emerged as another strong and effective regularization method. This method has been the vital part of the deep learning frameworks in many computer vision tasks. Third regularization method applied to CNNs for comparison purposes is kernel regularization (L1 and L2), plays important role in regularizing deep neural networks. The convolutional neural networks are trained with two different datasets (CIFAR-10 and Kaggle‘s Cats vs Dogs) for image classification. Three different methods of regularization (dropout, batch normalization and kernel regularization) are applied to CNNs.

The train accuracies and cross entropies, validation accuracies and cross entropies and mean test errors for each model are calculated. Each model is validated by 10-fold cross validation.

Finally the test errors for all the three regularization methods are compared to conclude best performing regularization method on our experimental set up.

(12)

1.3 Thesis Layout

The thesis is structured in a way, that this chapter gives a short introduction of the topic, explains the scope of the thesis, presents the research questions and also explains the structure of the thesis. Chapter2 is dedicated to the prior work done regarding the development of different regularization methods for regularizing convolutional neural networks (CNNs). This literature review helps in understanding of the existing state-of-the-art regularization methods particularly for CNNs and generally for other deep learning techniques and also presents the comparison of newly developed methods with the previous ones. Chapter 3 presents the basic theory and concepts required to understand the thesis work. This chapter also gives explanation of the choice of methods used for the experimental work. Chapter 4 explains the datasets, methods, experimental environment, and framework used for building CNNs.

Results obtained from the experiments are presented in Chapter 5. Chapter 6 presents some discussion on the results obtained. Finally the conclusions drawn from the thesis work are presented in chapter 7.

(13)

2 Literature Review

This section reviews prior work done in the field of regularization in deep learning generally and in convolutional neural network particularly. The review on one hand, provides a detailed picture of state-of-the-art regularization techniques applied to convolutional neural networks and on other hand reports the latest regularization methods which are developed recently to address overfitting in deep learning. Comparison of newly developed regularization methods with the older ones is also presented in this chapter.

Convolutional neural networks (CNNs) are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014, deep convolutional networks started to become mainstream, resulting substantial gains in various benchmarks. CNNs are the best suited deep learning technique for vision based tasks because of its architecture, its property of automatic feature extraction and its local connectivity (each neuron in CNNs is connected to small subset of input) which reduces the number of parameters in the network and increases the time complexity of the training process. Apart from all these good sides of CNNs, one main problem is overfitting (Tripathy and Jadeja 2015). Overfitting occurs, when the model does not generalize well from training data to unseen data. Regularization is one of the key ingredients of deep learning (Goodfellow, Bengio, and Courville 2016), which allows the model to generalize well to unseen data.

Dropout (Geoffrey Hinton et al. 2012) is one of the most popular regularization method used in convolutional neural networks. Dropout addresses the issue of overfitting by randomly dropping units with their connections from neural networks during training. This dropping out minimizes co-adapting. Dropout makes every neuron to be able to work independently.

As a result, the network performs independent of the small number of neurons (Srivastava et al. 2014). Dropout was a key ingredient of the systems that won learning competitions like ImageNet classification (Krizhevsky, Sutskever, and Hinton 2012) and the Merck molecu- lar activity challenge at www.kaggle.com. Dropout outperformed other methods with great accuracy in all these competitions. Dropout in (Krizhevsky, Sutskever, and Hinton 2012) and (Simonyan and Zisserman 2015) have produced state-of-the-art results. Some variants of dropout method are also proposed that offered improved empirical results and theoretical

(14)

Figure 1. (a) A neural net without dropout, (b) A neural net with dropout (Srivastava et al. 2014)

motivation.Fast dropout(Wang and Manning 2012) was proposed to train a neural network with dropout without actually sampling, thereby using all the data efficiently. Thus dropping out the final hidden layer of the neural network. Fast dropout resulted in improved empirical results for deep neural networks when tested with different datasets. Figure 1 shows a neural net without and with dropout.

DropConnect(Wan et al. 2013) is the generalization of dropout, which helps regularize large neural network models. In case of DropConnect fully connected layer becomes a sparsely connected layer and the connections are chosen randomly during the training time. Drop- Connect produced outstanding results on variety of standard benchmarks. One of the astonishing research work, which proposed a deep convolutional neural network architecture called inception, used dropconnect and produced great results (Szegedy et al. 2014). Drop- Part (Tomczak 2013) is the generalization of dropconnect. In DropPart Beta distribution instead of Bernoulli distribution is used. This method was proposed in a study of predicting breast cancer recurrence using Classification Restricted Botlzmann Machine (classRBM).

The research was done with the real life dataset consisting of 949 breast cancer cases. Al- though, the study was done by using classRBM, but author has claimed that it can be applied to deep neural networks. A comparison of dropout and dropconnect presented in (Wan et al. 2013) is shown in figure 2.

(15)

Figure 2. Comparison of dropout and dropconnect (Wan et al. 2013)

Standout (Lei Ba, Kiros, and Hinton 2016) proposed as adaptive dropout network, when evaluated on MNIST and Norb datasets, yielded better results than other feature learning methods including denoising auto-encoders, and standard dropout. DropAll (Frazão and Alexandre 2014) is the generalization of Dropout and DropConnect for the regularization of the fully connected layers within convolutional neural networks. DropAll has both the properties of Dropout and DropConnect that is, we can drop randomly selected subset of activations or we can drop randomly a subset of weights. So with DropAll it is possible to perform both methods (DropOut and DropConnect). DropAll improved the classification errors of networks trained with DropOut and DropConnect on a common image classification datasets. Dropout is studied from Bayesian standpoint Bayesian Dropout (Maeda 2014) and the research reveals that Bayesian interpretation enables to optimize the dropout rate.

Random Dropout(Bouthillier et al. 2015) is the interpretation of dropout as prior-knowledge free data augmentation. Random dropout is described as the procedure to generate samples by back- projecting the dropout noise into the input space. The research is carried by training feed forward neural networks using MNIST and CIFAR 10 datasets and produced improved dropout results without adding significant computational cost. Another research work (Gal and Ghahramani0 2016) proposed tools to model uncertainty with dropout and developed a framework using dropout training in deep neural networks as approximate Bayesian inference in deep Gaussian processes.

(16)

Curriculum Dropout(Morerio et al. 2017) is a time scheduling for the probability of re- taining neurons in the network and offers adaptive regularization scheme, that increases the difficulty of optimization problem smoothly. This idea is originated from curriculum learning which means starting easy and then adaptively increasing the difficulty of the learning problem. Curriculum dropout has provided better smooth initialization and weight optimization than standard dropout for training convolutional neural networks (CNNs) and Multi-Layer Perceptrons (MLPs) for image classification on different datasets.

Channel-Out(Wang and JaJa 2013) is a network architecture, which addresses the problem of high level of inference and less utility of network capacity. Dropout encodes all patterns to each network capacity bin, resulting in full use of network capacity but high level of inference. While the sparse pathway regularization methods result in least inference but leads to waste of network capacity. Channel-out is the combination of dropout and sparse pathway encoding to utilize the full network capacity and avoid inference at the same time.

The study concluded that the sparse pathway encoding would be effective for designing robust deep networks.

Stochastic Pooling(Zeiler and Fergus 2013) is a novel and effective regularization method for regularizing large convolutional neural networks(CNNs). In this method conventional pooling is replaced by stochastic procedure and the activation within each pooling is picked randomly according to multinomial distribution. Stochastic pooling is tested on a variety of benchmark image datasets (MNIST, CIFAR10, CIFAR100, street view house numbers) and it is proved that this method is not only effective but also does not require any hyperparameter tuning. This technique can be applied with other regularization techniques such as dropout, weight decay, and data augmentation with negligible computation overhead.

Batch Normalization (Batch Norm or BN) (Ioffe and Szegedy 2015) has been emerged as a very effective regularization method in deep learning. BN addresses the problem of covariate shift by normalizing the features by the mean and variance computed within a mini-batch, and allows high learning rates and does not much care about initialization. This milestone technique has achieved same accuracy with less training steps, when applied to a state-of-the-art image classification model (Ioffe and Szegedy 2015). A recent study about batch normalization (Santurkar et al. 2018) unveils the real impact of batch normalization

(17)

on training deep neural networks other than internal covariate shift (ICS). The study shows that the batch normalization reparametrizes the optimization problem to make it more stable and smooth. So BN provides the smoothness to the optimaization landscape.Thus BN makes the gradients more reliable and predictive, makes the training significantly faster and less sensitive to hyperparameter choices (Santurkar et al. 2018). The batch normalization is playing vital role in computer vision tasks and has been adopted by all major deep learning frameworks (He et al. 2015). In “Rethinking the inception architecture for computer vision” (Szegedy et al. 2014), very deep convolutional neural networks are trained with batch normalization and benchmarked their methods on the ILSVRC 2012 classification challenge.

Significance of batch normalization has been highlighted in (Liao and Carneiro 2015), which shows that layer normalization in deep networks with piecewise linear activation functions such as ReLu, leaky ReLu and parametric ReLU is of great importance. In their experiments on MNIST, CIFAR10, CIFAR100 and SVHN datasets they found that applying batch normalization before the nonlinear activations is of key importance in order to accelerate the training and achieve higher accuracies. Batch normalization method is not well suited to recurrent models such as LSTM and reinforcement learning or generative models.

Layer Normalization(LN) (Lei Ba, Kiros, and Hinton 2016) is the advanced form of BN.

LN addresses the problem of computation time in training the state-of-the-art deep neural networks, as they are computationally expensive. This method is very effective at hidden state dynamics in recurrent networks. Batch normalization reduces the training time in feed forward neural networks but because of its dependency on mini-batch size, it is not clear how to apply on recurrent neural networks. Unlike batch normalization, the layer normalization substantially reduces the training time by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case.

Weight Normalization(WN) (Salimans and Kingma 2016) is the reparametrization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. Weight normalization is the inspiration from batch normalization but does not introduces its dependencies on mini-batch. This advanced method has advantage over batch normalization because of its less computation overhead which permits more optimization

(18)

steps taken in same amount of time and speed up convergence of stochastic gradient descent.

Unlike batch normalization, weight normalization is well suited to recurrent models such as LSTM and reinforcement learning or generative models. Comparison study between batch normalization and weight normalization for the large scale image classification problem (i.e, ResNet-50 on ImageNet) shows that batch normalization has much stronger and stable regularization effect than weight normalization (WN). Weight normalization is thus limited to shallow networks and can not replace batch normalization for deep neural networks (Gitman and Ginsburg 2017).

Orthogonal Weight Normalization(Huang et al. 2017) is the generalization of square orthogonal matrix to orthogonal square matrix. This technique of normalization is well applicable to recurrent neural networks (RNNs) and feed forward neural networks (FFNs). Or- thogonal weight normalization solves the problem of optimization over multiple dependent stiefel manifolds (OMDSM). Methodology of ortogonal weight normalization has improved the performance of the state-of-the-art networks, including residual networks and inception on ImageNet and CIFAR datasets.

Group Normalization(GN) (Wu and He 2018) is a simple and effective alternative to batch normalization. In case of batch normalization, normalizing along the batch direction introduces problems and the BN‘s error increases with the decrease in the batch size. This problem limits the application of BN to the computer vision tasks including video, detection and segmentation which require small batches. Group normalization fixes this problem by dividing the channels into groups and computes the mean and variance within each group for normalization. Unlike BN, GN‘s accuracy is stable over wide range of batch sizes and its computation is independent of batch size. Group normalization can be easily transferred from training to fine tuning. GN has given 10 percent less error than BN, when used for ResNet-50 trained on ImageNet. GN can effectively replace the strong BN in a variety of computer vision tasks like object detection and segmentation in COCO and, video classification in Kinetics according to the study presented in (Wu and He 2018).

L1, L2 Regularization/kernel regularization, L1 regularization/Lasso (Tibshirani 1996) has been used to attain sparse solutions. L1 regularization with its variants such as group sparsity regularization seem to be propitious for deep neural networks (Wen et al. 2016) in

(19)

order to attain reduced computation and power consumption. On the other hand L2 regularization smooths the parameter distribution and thus reducing the magnitude of parameters resulting in less prone to overfitting model. L2 regularization plays important role in training deep neural networks (Krizhevsky and Hinton 2012), and has achieved high performance when combined with dropout regularization (Srivastava et al. 2014). However, it is studied that L2 regularization has no effect of regularization when combined with batch normalization (Laarhoven 2017).

Shakeout(Kang, Li, and Tao 2016) is a simple and effective regularization method for training deep neural networks. Unlike dropout which works by imposing an L2 regularizer on weights (Wager, Wang, and Liang 2013), shakeout uses a combination of L1 regularization and L2 regularization imposed on the weights. Empirical evaluation of shakeout regularization on MNIST and CIFAR10 datasets shows that shakeout reduced overfitting effectively.

Shakeout is equivalent to introducing elastic net-like regularization.

Cutout(DeVries and Taylor 2017) is a simple but effective regularization method to improve the robustness and performance of the convolutional neural networks. Cutout is a technique of randomly masking out square regions of input during training. This method can be easily implemented with existing forms of data augmentation and other regularization techniques to improve the performance of convolutional neural networks. Cutout is an extension of dropout but, the comparison among both methods shows that cutout forces model to take the full image context consideration rather than focusing on few visual features. Another major difference between cutout and dropout is that, the units are dropped at the input stage rather than in the intermediate layers. In this way the visual features removed from input layers are correspondingly removed from all subsequent feature maps. While in case of dropout, each feature map is considered individually, so the features randomly removed from one feature map may still be present in others. These inconsistencies produces noise, therefore forces network to become more robust to noisy inputs. Hence cutout is much closer to the data augmentation than dropout, as its not creating noise instead producing images that are novel to the network. Cutout regularization has produced state-of-the-art results when evaluated on the CIFAR-10, CIFAR-100, and SVHN biomarkers.

Shake-Shake Regularization(Gastaldi 2017) is an attempt to produce softer augmentation

(20)

than random flips or crops. The idea of shake-shake is to change standard summation of residual branches by a stochastic affine combination in 3-branch Res-Net. When empirically tested on CIFAR10 dataset, shake shake regularization shows great performance. Compari- son of shakeout and shake-shake shows that both methods use the idea of replacing bernoulli variables by scaling coefficients. While starting point for shakeout is dropout and that of shake-shake is a mix of of FractalNet drop-path and Stochastic Depth. Shakeout keeps coefficients same between forward and backward passes whereas shake-shake updates them before each pass. Shake-shake works by adding up 2 residual flows and a skip connection whereas shakeout requires only one flow.

Early stoppingis a very effective and simple technique. It is done by stopping the training process before the model starts to fit. In the beginning of the training, the error value falls for both the train and test sets. But with the passage of time, the train set error continue to fall, but the test set error value starts to increase to overfit the training data. Early stopping is achieved by training a network until it starts to overfits and then returning to the point, which generated the minimum error for the test set (Goodfellow, Bengio, and Courville 2016).

However, it is very expensive and exhausted task to keep an eye on each iteration. Instead, more practical approach is to monitor both the test and train errors and stop training when the test error is not decreasing anymore after few epochs, for example 10 epochs. Figure3 shows that training error goes on decreasing while validation error starts increasing after certain number of epochs. So at the point after which the validation error start increasing the training should stop.

Data augmentationis another very effective regularization technique. Ideally, the best way to train a neural network like CNNs is to have large data. But in practical, data is always limited. The dataset is simulated or artificially created by applying small transformations to the training data that mimic variations that can be appeared in unseen data and are not going to change the class of the example. In this way we can make the network more robust. In case of specific classification tasks like speech recognition (Jaitly and Hinton 2013), data augmentation is very effective method of regularization. For images, data augmentation can be done by changing the orientation (up, down, left, right) of the pixels by 1, by changing color saturation, by multiple cropping of the image, or by rotating an image by a degree

(21)

Figure 3. Graphical representation of early stopping

or two. In case of image rotations, large rotations are not recommended. Because it can change the digit of image. A large rotation can change a digit of image 9 into 6 (Goodfellow, Bengio, and Courville 2016). Fancy PCA is another way of data augmentation introduced in AlexNet in 2012. Noise injection to the input of neural networks can also be seen as a form of data augmentation (Sietsma and Dow 1991).

(22)

3 Theory

Advanced Machine learning and deep learning techniques are providing cutting edge accuracy in performing computer vision tasks. Convolutional neural networks have earned big fame in solving tasks like image classification, image segmentation and registration, image retrieval, and object detection. Because of the great utility of the CNNs, researchers of the field are concentrating more on enhancing the performance of the CNNs. Therefore development of new regularization methods is a key to high performance of CNNs. As this thesis work focuses on the regularization for convolutional neural networks so, the theory will present the concepts related to the thesis work. This section will introduce image classification and deep learning on a general level. Then neural networks are explained. Convolutional neural networks are explained in more detail and finally, the regularization methods applied to CNNs model for image classification in this thesis work are explained.

3.1 Image Classification

Image classification involves the extraction of features and classification of objects in different classes. There can be supervised classification or unsupervised classification. In supervised classification, sample pixels in an image are selected as the representative of specific classes. These representative training sites are then used as the references for the classification of all other pixels in the image. Supervised classification is done by trained database and human intervention. Support vector machines are supervised machine learning models, used for analyzing data and recognizing patterns (Wang et al. 2014). Main advantage of supervised classification is that, the errors can be detected by the operator and can be cor- rected. Disadvantage of supervised learning is that, if the training data selected by the analyst may not contain the enough information required for training the model, then it is vulnerable to human error (Wang et al. 2014). In unsupervised classification, pixels with common characteristics are selected by the software analysis of the image. Different algorithms are used to determine the related pixels and group those pixels in classes. In case of unsupervised classification, no prior information or human intervention is needed. The advantage of unsupervised learning is that, it is fast and free from human intervention and main dis-

(23)

advantage of unsupervised classification is maximally separable clusters (Wang et al. 2014).

Non-parametric classifiers like neural networks, knowledge-based classifiers and decision tree classifiers have already become important techniques for multisource data classification (Lu and Weng 2005).

Deep learning techniques are the state-of-the-art tools for image classification in (Krizhevsky and Hinton 2012) and (Zeiler and Fergus 2013). Convolutional neural network has shown remarkable capability in ImageNet classification challenge (Krizhevsky and Hinton 2012).

Ever since, CNNs have gained popularity in solving computer vision problems.

3.2 General introduction to deep learning

Deep learning is one way of creating machines that can think intelligently, using a specific algorithm called neural network. DL is a part of a broader family of machine learning methods based on learning data representations, as opposed to task specific algorithms (Goodfellow, Bengio, and Courville 2016). Deep learning methods can be supervised, partially supervised or unsupervised. Concept of neural nets or deep neural nets is taken from biological ner- vous systems, where neural coding explains the relation between a certain stimulus and the response of the brain (Schmidhuber 2015). Most common deep learning techniques are;

• Recurrent neural networks (RNNs)

• Deep neural networks (CNNs)

• Deep belief networks (DBNs)

• Convolutional Neural Networks (CNNs)

• Feed Forward Neural Networks (FFNs)

• Autoencoders

Over the past years research has shown that deep learning has a wide scope of applications in almost every field of life. Deep learning algorithms have been applied to the fields like bioinformatics (Min, Lee, and Yoon 2017), medicine and healthcare (Danaee, Ghaeini, and Hendrix 2017), space informatics, weather forecasting (Jiang, Xu, and Wei 2018), education (Warburton 2003), traffic and transportation (Polson and Sokolov 2017), (Ma et al. 2017), agriculture (Kamilaris, Francesc, and Boldu 2018), robotics (Pierson and Gashler 2017), and

(24)

gaming (Schuurmans and Zinkevich 2016).

3.3 Artificial Neural Networks

ANN is built on connected nodes called neurons or artificial neurons. Connection between the neurons is called synapse, which is a signal transmitted from one neuron to the other.

The postsynaptic neuron, which is the receiving neuron then process the signal and passes to the next neuron. The strength of the signal depends upon the weights of the neurons and synapses. Weights vary with the learning process and thus increasing or decreasing the strength of the signal. The state of the neuron is represented by a real number, between 0 and 1. Learning of a neuron in a neural net consist of processors as neurons, where each neuron produces sequence of activations. Activation of input neurons occurs from the environmental stimuli, while the neurons in the hidden or middle layers are activated through the weighted connections from previous neurons (Schmidhuber 2015).

ANN can be well understood with the explanation of perceptron (Rosenblatt 1958), which is a simple neural network consisting of input layer, output layer, and one or more intermediate layers of neurons. When the input layer neurons are clamped to their values, the signal passes layer by layer and the neurons find their output. This is called feed forward configuration.

The output values depend on the input values with all their synaptic weights and thresh- olds. In training phase known patterns are fed to the network and the weights are adjusted accordingly to produce the output. Then in testing phase, the test patterns are introduced to the neural network and it gives the output or target value correspondingly (Marsolek and Burgund 1997). The batch size is the number of patterns shown to the network before the weights are updated. It is also an optimization in the training of the network, defining how many patterns to read at a time and keep in memory. The number of epochs is the number of times that the entire training dataset is shown to the network during training.

3.3.1 Activation Function

In ANN, the activation function is an abstraction or the rate at which the neuron is firing or not. In the convolutional neural network context activation function is sometime called non-

(25)

linearity, takes a single number and performs mathematical operation on it. This function is normally binary function, which means that either it is firing or not. Activation function can also be explained as the weighted sum of all the inputs, adds a bias and decides if a neuron is fired or not. Activation functions must also have behaved as nonlinear and continuously differentiable as mentioned earlier in context of CNNs. Nonlinearity provides the neural network a universal approximation and continuously differentiable function which is important for gradient-based optimization methods, and allows the efficient back propagation of errors throughout the network (Kriesel 2017).

Following are the most commonly used activation functions for the neural nets.

Rectifier Linear Unit- ReLU: Mathematically it is shown in equation 3.1;

f(x) =max(x,0) (3.1)

which is simply threshold at 0, where x is the input to the neuron. This function implies to the non-linearity (Kriesel 2017).

Hyperbolic tangent-tanh: This function introduces non-linearity to the range of [-1, 1]. The tanh nonlinearity is better than sigmoid nonlinearity because its output is zero centered.

Mathematically it is written as in equation below;

tanhx= 1−e^−2x

1+e^−2x (3.2)

Sigmoid: Sigmoid activation function introduces nonlinearity to the real valued number between the range of 0 and 1. Equation 3.3 represents the sigmoid function.

f(x) = 1

1+e^−β^x (3.3)

(26)

3.3.2 Cross Entropy

A cost function is the estimation of the difference of the estimated value and the actual value.

In ANN when the input signal is passed on to the neuron, the activation function is applied, and weights are adjusted to produce some output target value. Model accuracy depends upon the value of cross entropy loss, smaller the value, more accurate the model is. Cost function can be reduced by using gradient descent, stochastic gradient descent, forward or backward propagation. (Goodfellow, Bengio, and Courville 2016).

The loss function used in our experimental setup is binary/categorical cross entropy. The cross-entropy is the most popular loss function for classifying images in neural networks generalized to multiple classes via the softmax function and the negative log likelihood.

Cross entropy loss is mathematically represented as in equation 3.4;

L(x,x) =ˆ −

N n=1

∑

x⁽ⁿ⁾log(xˆ⁽ⁿ⁾)

+ (1−x⁽ⁿ⁾)log(1−xˆ⁽ⁿ⁾)

, (3.4)

When the output of a network can be represented as an independent hypothesis or each node represents differently then there is a probability that each hypothesis may be true. Cross- entropy function is more useful in problems where the target can be 1 or 0.

3.4 Model Performance Evaluation

Most commonly used methods for evaluating the performance of a classifier are holdout, bootstrap, random sampling and cross validation. Explanation of all these methods is out of scope of this thesis. We have applied K=10-fold cross validation method to validate our results in this thesis, explained as below.

3.4.1 K-fold Cross validation

K-fold cross validation is a method for estimating the tuning parameterλ. we divide the data in K equal parts, for each k=1,2,3. . . ., K fitting the model with parameter to other K-1

(27)

parts giving the termβ^−kλ and computing its error in predicting the kth part. In this thesis we applied K=10-fold cross validation, which means we divided the whole data into 10 equal parts, took one part for validation and the remaining nine parts for training. Generally, it is not in practice to use k-fold cross validation in deep learning because of greater computational expense and time. But from the research point of view, it is worth to use K-fold cross validation for two reasons. Firstly, for the estimation of the generalization of the test errors and secondly, to see the variations of the sensitivity of data with different methods. Equation 3.5 gives the error in predicting kth part.

E_k(λ) =

∑

iεkth

(y_i−x_iβˆ^−k(λ)2

(3.5)

Equation 3.6 gives the mathematical expression for the cross-validation error.

CV(λ) = 1 K

K k=1

∑

(E_k(λ

(3.6)

The performance analysis of each model is calculated for instance by measuring the sensitivity (proportion of true positives), specificity (proportion of true negatives), accuracy and area under the curve(AUC).

3.4.2 Accuracy

For our model performance we measured accuracies. Accuracy is a measure related to the total number of correct predictions. By accuracy here means how often the classifier is correct, so accuracy is mathematically given as in equation 3.7.

Accuracy= T P+T N

Total (3.7)

where TP stands for true positive predictions and TN stands for true negative predictions and Total represents total number of samples classified.

(28)

3.5 Convolutional Neural Networks-CNNs

A CNN normally takes an order 3 tensor as its input, e.g., an image with H rows, W columns, and 3 channels (R, G, B color channels). CNN handles higher order tensor inputs in similar fashion. The input then goes sequentially through a series of processing steps. A layer is one processing step, which can be a convolution layer, a pooling layer, a normalization layer or a fully connected layer, etc. A simple CNN Architecture presented in (Wu, D., and Sabuncu 2016) applied two convolutional layers followed by pooling layers, applied after each convolutional layer. Activation function ReLU is applied with each convolutional layer. Another simple CNN architecture is shown in figure 4.

Convolutional layer: Convolutional layer performs the main operations of training the network. In each learning iteration the weights are trained in the convolutional layers and updated and adjusted using back propagation algorithm. The set of weights is termed as a filter or a kernel, that is going to be convolved with the input.

The ReLu layer- Rectifier linear unit: ReLu is the second layer after convolutional layer and the most commonly used activation function for the outputs of the neurons. The purpose of this layer is to apply the nonlinearity to the network after convolutional layer, which has been computing the linear functions. ReLU changes all the negative activation values to zero.

The Pooling layer/ Down sampling: The pooling layer is also called the downsampling layer.

Most commonly used is the max pooling. Some other options are average pooling and L2 norm. This layer works by taking a filter of size (22). The filter takes the maximum number from the input volume, then it extracts the maximum number from each subregion of the volume. MaxPooling is useful for two basic reasons. First is the reduction of computation for the upper layers, which means the reduction of parameters and weights. The second reason is the translational invariance or the reduction/controle on overfitting. (LeCun et al. 1998).

Flattening: Pooled featured map is then flattened into number vector columns (vector of inputs).

Fully Connected Layers: Purpose of fully connected layers in CNNs is to combine features from all the previous layers, as neurons in this layer are connected to all the numbers in pre-

(29)

Figure 4. Deep Neural Network, with three hidden layers, input and output layers (Peng et al. 2016)

vious volume. In fully connected layers, sometimes layers are dropped out and this dropping out is called Dropout method (Srivastava et al. 2014). Fully connected layer produces an output equal to the number of classes we require. Convolution layers produces 3D activation maps while we only require 1D vector as output. The output layer has a loss function like categorical cross-entropy, which compute the error in prediction. After the completion of the forward pass, the backpropagation begins to update the weights and biases for error and loss reduction.

Performance of a model highly depends upon the selection of the parameters. In building CNN models, the optimization and selection of hyperparameters is very important. Once the model is built, it is then fine-tuned by changing parameters in a way to get a good combination of hyperparameters to achieve higher performance of the model. These hyperparameters include learning rates, batch sizes, training rates, image processing parameters, number of layers, convolutional filter size, and dropout fractions (if dropout is applied) (LeCun et al. 1998).

3.6 Overfitting and Underfitting

The Problem of overfitting in ML/DL techniques arises, when the model learns the training data too well in such a way, that it includes noises and fluctuations in learning and produces negative performance of the model on new data. The problem of underfitting arises when the

(30)

model is not performing good for even training data. In that case, it is better to try alternate ML/DL algorithm to fit the training data and get satisfactory results. Training a model to get accurate output value becomes challenging in both cases (overfitting, underfitting). If the model learns too much on the training data but gives high values of loss errors on unseen or test data, then the model is overfitting. A reliable performance on the unseen data makes a model well performed. We can achieve high performance of a model by feeding the network with more and better data, by giving the neural network more information about the real function, by decreasing the complexity of the network, and by regularizing the network so that the network approximates the real function. In case of convolutional neural networks, it has been observed that big CNNs have performed best, so much of the research in the field of deep neural networks is focused on the development of new methods to overcome overfitting issue, as bigger networks are more prone to overfitting (Geoffrey Hinton et al. 2012,Wan et al. 2013, Tomczak 2013, Wang and JaJa 2013, Szegedy et al. 2014,Lei Ba, Kiros, and Hinton 2016, Bouthillier et al. 2015 ,Ioffe and Szegedy 2015, Morerio et al. 2017, Wu and He 2018).

3.7 Regularization

Tuning of a neural network/model by the selection of preferred parameters, so that the model performs well and give higher accuracies is termed as regularization. Regularization is the most effective way of achieving hyperparameter optimization in deep neural networks.

(Goodfellow, Bengio, and Courville 2016)

Regularization is also defined as any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.(Goodfellow, Bengio, and Courville 2016)

Neural networks are excellent models to achieve approximation for highly complex models, but more vulnerable to overfitting. In supervised learning tasks, when the training data is used to train a neural network, it is very common that the loss is very low for the training data, but very high for the test data. This problem leads to the overfitting problem for most of the complex neural networks. The performance of the model depends on how well the

(31)

model performs on unseen data. There are three main ways to achieve the performance of the model.

• Feeding the network with more and better data and giving the neural network more information about the real function.

• Increasing the complexity of the network by adding more layers

• Regularizing the network so that the network approximates the real function.

All the above-mentioned ways are related to each other and by considering only any one of them will not solve the problem. To reduce overfitting problem in neural networks, the better approach is to have high capacity network, large and better data and regularization methods.

We have discussed in chapter 2 in detail about different regularization techniques applied to convolutional neural networks and other deep learning techniques. Here only those methods of regularization are explained, which are applied to our CNN model for comparison purposes in this thesis work. From the literature study, it has been evident that dropout (Geof- frey Hinton et al. 2012) is the most effective and commonly used regularization method for regularizing convolutional neural networks (Smirnov, Timoshenko, and Andrianov 2014).

Methods like DropConnect (Wan et al. 2013), standout (Lei Ba, Kiros, and Hinton 2016) , Dropall (Frazão and Alexandre 2014), random dropout (Bouthillier et al. 2015), and curriculum dropout (Morerio et al. 2017) are all the generalizations of dropout. Batch normalization has been emerged as another strong and effective regularization method. This method has been the vital part of the deep learning frameworks in many computer vision tasks He et al. 2015 He et al.; 2015). Third regularization method applied to CNNs for comparison purposes in this thesis work is L1 and L2 regularization (named kernel regularization in keras documentation) plays important role in regularizing deep neural networks. It has been reported in the research that L2 regularizations when combined with dropout, reduces overfitting effectively (Krizhevsky and Hinton 2012). So, it would be interesting to see how L2 behaves when applied to CNNs without other regularization methods. Introductory explanation to the above mentioned three regularization methods is given below.

(32)

3.7.1 Batch Normalization

In Deep neural networks, the input of each layer changes during training with the change of the parameters of the previous layer. As a result, training slows down. This phenomenon is called internal covariate shift (ICS). This problem is solved by normalizing layer inputs and the method is called Batch Normalization. During training, each mini batch is normalized using much higher learning rates (Ioffe and Szegedy 2015). Batch normalization not only reduces the overfitting, but also increases the training by allowing higher learning rates and reducing the sensitivity to the initial starting weights. Addition of mean parameter and standard deviation parameter maintains the normalized outputs by keeping the mean activation close to zero and activation standard deviation nearly equal to one. For convolutional layers, normalization should follow the convolution property as well i.e, different elements of the same feature map at various locations are normalized in the same way. So, all the activations in a mini-batch are jointly normalized over all the locations, and parameters are learnt per feature map instead of per activation. In traditional deep networks, learning rate with a very high value may result in the gradients that vanish. Batch Normalization helps in minimizing such problems. Activation normalization throughout the network prevents slight changes in layer parameters from amplifying, as the data propagates through a deep network. Batch Normalization makes training more flexible to the parameter scale. Large learning rates may increase the scale of layer parameters, which can amplify the gradient during backpropagation and lead to the model explosion (Ioffe and Szegedy 2015). A recent study shows that BN adds smoothness to the internal optimization problem of the network (Santurkar et al. 2018).

We applied BN as one of the regularization method to regularize our CNN model because according to the literature study, it is well established that BN is well suited to convolutional neural networks. Batch normalization has been emerged as another strong and effective regularization method. This method has been the vital part of the deep learning frameworks in many computer vision tasks (He et al. 2015). A part from internal covariate shift it also smoothes the optimization landscape (Santurkar et al. 2018), thus providing more stability and reliability to the gradients and increases the training speed significantly. One another good reason for using BN is that, batch norm is less sensitive to hyperparameter choices.

(33)

3.7.2 Dropout

Dropout (Geoffrey Hinton et al. 2012) is a simple and an effective regularization technique.

Convolutional neural networks with a massive number of parameters is a strong DL technique but problem of overfitting is there to solve. Dropout addresses the issue of overfitting by randomly dropping units with their connections during training from neural networks.

This dropping out minimizes co-adapting. Dropout makes every neuron to be able to work independently, as a result the network performs independent of the small number of neurons.

During training of the neural network, some neurons with their connections are randomly dropped to prevent too much co-adaptation. A unit with all its incoming and outgoing connections is removed temporarily from the network and termed as a dropout. Dropout samples the Neural Network within the full neural network and on the bases of the input data, updates the parameters of the sampled network. Therefore, the exponential number of sampled networks are dependent, as they share the parameters. Dropout reduces overfitting significantly and improves the performance of neural networks on many supervised learning tasks. This technique is not applied during testing and is always applied to input/ hidden layer nodes instead of output nodes. (Srivastava et al. 2014). The main drawback of dropout is the increased time consumption. The network with dropout method takes 2-3 times longer time for training than the normal standard neural network (Srivastava et al. 2014). One main reason for this time increment is that, the parameter updates are very noisy. The gradients being computed are not the gradients of the architecture that will be used at test time. Therefore, training takes a long time. But if the noise is reduced, training time can be reduced. There- fore, with high dropout, we can reduce overfitting at the cost of longer training time. It is practical to start with low dropout value like 0.2, and then fine tune. In practice, the value of dropout ratio is 0.5 (a default value), but this can be tuned on and it can be 0.1-0.5.

We opted to apply dropout to our CNN model because dropout is the most effective and commonly used regularization method for regularizing convolutional neural networks (Smirnov, Timoshenko, and Andrianov 2014). Another good reason is that, dropout is easily applicable to CNNs. Dropout is computationally very cheap and it does not limit the type of the model.

The dropout works well with any type of model that uses distributed representations and can be trained with stochastic gradient descent.

(34)

3.7.3 Kernel regularization

L1 penalizes the absolute value of the weight and tends to drive some weights exactly to zero. L2 penalizes the square value of the weights and tends to drive all weights to smaller values. L1 and L2 regularization can be combined and this combination is called Elastic Net Regularization. L1 regularization uses most important inputs and behaves invariantly to the noisy ones. L2 regularization is preferable over L1, because L2 gives final weight vectors in small numbers. L2 is the most common type of regularization. It implements by penalizing the squared magnitude of all parameters directly in the objective. Kernel regularization has produced excellent results in terms of accuracies when applied to the convolutional neural networks for visual recognition tasks including hand written digits recognition, gender classification, ethnic origin recognition and, object recognition (Yu, Xu, and Gong 2009). It has been observed that kernel regularization smooths the parameter distribution and reduces the magnitude of parameters, hence resulting in less prone to overfitting, and effective solution.

We applied kernel regularization as third regularization method applied to CNNs for comparison purposes in this thesis work because kernel regularization plays important role in regularizing deep neural networks. It has been reported in research that L2 regularizations when combined with dropout reduces overfitting effectively (Krizhevsky and Hinton 2012).

So it would be interesting to see how L1 and L2 behave, when applied to CNNs without other regularization methods.

(35)

4 Experimental Setup: Materials and Methods

The following sections of the chapter provide an overview of the datasets, model, and test environment used for the experimental setup.

4.1 Datasets

For the experimental work, we used open source imaging datasets which are arguably famous ones used for image classification competitions.

4.1.1 Dataset-1: Cats Vs Dogs

Dataset-1 is taken from the famous Kaggle‘s competition platform (KMLC-Challenge-1).

KMLC challenge dataset of Cats Vs Dogs is Asirra dataset provided by Microsoft Research, comprises of 25,000 colored images with spatial size of 64x64 pixels. One of the best accuracies achieved on this dataset with convolutional neural network is 94 percent (Liu, Liu, and Zhou 2014). The sample image from cats vs dogs dataset is shown in figure 5 (left). The main reasons for using cats vs dogs data is firstly, the easy and free availability of enough amount of imaging data to train the convolutional neural network and secondly, it addresses a binary classification problem.

4.1.2 Dataset-2: CIFAR-10

Dataset-2 consists of CIFAR-10 imaging data, most commonly used in research for classification tasks. CIFAR-10 is a subset of 80 million small images dataset (Krizhevsky and Hinton 2012). CIFAR-10 comprises of 60,000 color images with spatial size of 32x32 pixels. It consists of 10 classes ( airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) with 6000 images per class. The classes are mutually exclusive means, there is no overlap between truck and automobile classes. The frequency of the class labels is exactly equal for all classes so the CIFAR-10 is a balanced dataset. Highest accuracy achieved on CIFAR-10 dataset is 96.53 percent with convolutional neural networks (Graham 2014). We

(36)

Figure 5. Sample images from Cats Vs Dogs (left) and CIFAR-10 (right)

find this data well suited for our experimental work firstly for easy availability of large training data and secondly, it is a categorical classification task. Figure 5 (right) shows the visual representation of the CIFAR-10 dataset.

4.2 Model: Convolutional Neural Network,CNN

We classify the images using Deep learning technique called the Convolutional neural network (CNN) explained in chapter 3. CNN model is built for Cats Vs Dogs and CIFAR-10 dataset for the image classification problem.

The CNN model for Cats Vs Dogs dataset features eight convolutional layers with 33 kernel and stride set to 22. Convolutional layers appear in groups of two with same filter size followed by pooling layer of pool size 22 and stride size 22 as well. ReLU activation function is added to each convolutional layer. Finally the flatten layer followed by the dense layer of size 128(3x3) is added. If batch normalization is to be added, it will be added between the convolutional layers. If dropout is to be added, it will be added with each pooling layer. Final output as two classes (binary classification, 0 or 1) is computed with softmax. The model description for Cats Vs Dogs dataset is shown in table 1.

(37)

Table 1. Model description for Dataset-1 (Cats Vs Dogs)

Layers Description

Input image 64x64 color image 2xConvolution 32(3x3) each

Maxpool Pool size, stride size = 2x2 2xConvolution 64(3x3) each

Maxpool Pool size, stride size = 2x2

Reshape Flatten

2xDense layer 128(3x3) each Softmax output 2 classes

The CNN model for CIFAR-10 dataset features six convolutional layers with 33 kernel and stride set to 22. Convolutional layers appear in groups of two with same filter size followed by pooling layer of pool size 22 and stride size 22 as well. ReLU activation function is added to each convolutional layer. Finally the flatten layer followed by dense layer of size 128(3x3) is added. If batch normalization is to be added, it will be added between the convolutional layers. If dropout is to be added, it will be added with each pooling layer. Final output as ten classes (categorical classification) is computed with softmax. The model description for dataset-2 is shown in table 2.

Train and validation accuracies and cross entropies (explained in chapter 3) are being calculated. The loss function used is binary/categorical cross entropy. The cross-entropy is the most popular loss function for classifying images in neural networks generalized to multiple classes via the softmax function and the negative log likelihood (explained in chapter 3).

CNN model is validated by 10- fold cross validation (explained in chapter 3) and calculated accuracies and cross entropies for each fold. Mean test error for 10- fold CV for each regularization method ((Dropout, batch normalization, kernel regularization) is calculated and compared by plotting box plot as shown in figure 11 and figure 16 in chapter 5. Computation times for each experiment are shown in table 4 and table 6 in chapter 5.

(38)

Table 2. Model description for dataset-2 (CIFAR-10)

Layers Description

Input image 32x32 color image 2xConvolution 32(3x3) each

Maxpool Pool size, stride size = 2x2

Reshape Flatten

2xDense layer 128(3x3) each Softmax output 10 classes

4.3 Test Environment

Software: Keras framework is used for building convolutional neural networks with theano as backend. Keras (Keras: The python deep learning library) is a high level neural network API written in python and capable of running on top of TensorFlow, CNTK or Theano.

Keras enables fast computations and is compatible with python 2.7-3.6. Keras framework is user friendly, thus provides high level of modularity, easy extensibility and compactness.

Python 3.5 is used as programming language in this thesis work. To support the experimental work, many open source libraries like pandas, matplotlib, numpy, tensorflow, scipy, and regularizers etc, are used. Anaconda is used as virtual environment as shown in figure 6.

6

Hardware: We ran the experiments on CPU (cpu family:21, cpu cores: 4, cupid level:

13) with system running on ubuntu with GNU/Linux 16.04 LTS for Cats Vs Dogs. The GPU machine used for CIFAR-10 features Nvidia Tesla K40c with 2880 CUDA cores. Our experiments ran on the test environment provided by the University of Jyväskylä as shown in figure 6.

(39)

Figure 6. Test Environment at University of Jyväskylä

(40)

5 Experimental Results

This chapter is fully dedicated to the presentation of the results obtained from the experimental work done. CNN model is built for two different datasets (Cats Vs Dogs and CIFAR-10, explained in chapter 4) to study the comparison and effect of different regularization methods (dropout, batch normalization, and kernel regularization). For each dataset four CNN models are built. One model without any regularization and rest of the three models with three different regularization methods. Model performance for each model is calculated by measuring train and validation accuracies and cross entropies. Test errors are calculated and presented in box plot. Ten different lines/curves with different colors in the plots shows the results for ten different folds as we applied 10-fold cross validation explained in chapter 3.

Finally the comparison of test errors with different regularization methods is shown in the box plot figure 11 and figure 16, which concludes about the best performing regularization method on our experimental set up.

We made use of CPU and GPU for our experiments. We ran experiments for Cats Vs Dogs on CPU and for CIFAR-10 on GPU. We compared the running times of experiments with different methods of regularization on CPU and GPU separately. The purpose of the comparison of running time shown in table 4 and table 6 for different methods is to see which method is fast in computation on same dataset. We are not aiming to compare CPU and GPU times as we used two different datasets on CPU and GPU. we are interested here in learning about the computation times of experiments with different regularization methods to observe which method is fast in computation.

5.1 Results for Dataset-1 (Cats Vs Dogs)

The CNN classifier is used to classify the images of dogs and cats. The results are obtained in terms of train and validation accuracies, binary cross-entropy loss and test errors. We did four experiments with 10-fold cross-validation to study the performance of the CNN model. First experiment was without any regularization method applied and it was used as the baseline. Second experiment was done with applying batch normalization method,

(41)

Figure 7. Train accuracy (top left), train CE (top right), validation accuracy (bottom left), validation CE (bottom right) for 10-fold cross validation for Baseline

third with applying dropout (0.2, 0.3, 0.4) and fourth experiment with Kernel regularization (L1=0.001, L2=0.001).

5.1.1 Baseline Experiment with No Regularization

The experiment ran for 25 number of epochs with 10-fold cross validation. The train accuracies, train cross entropies, validation accuracies and validation cross entropies for the baseline is shown in figure 7. Test error for the baseline model is 0.4 as shown in table 3.

Curves for the training accuracies and cross entropies are smooth and stable as compared to the validation curves for training and cross entropies which shows that during validation, there is more noise and randomness in the data.

5.1.2 Regularization Method-1: Batch Normaliztion

We added the batch normalization as the regularizer between the convolutional layers to our baseline model and got some different results. As the BN reduces the internal covariate shift, explained earlier in 3, so it takes longer time to get each batch normalized. With BN we have got increased train and validation accuracies as shown in figure 8. It is to be noted that if we

The Impact of Regularization on Convolutional Neural Networks