Thesis Layout - The Impact of Regularization on Convolutional Neural Networks

The thesis is structured in a way, that this chapter gives a short introduction of the topic, ex-plains the scope of the thesis, presents the research questions and also exex-plains the structure of the thesis. Chapter2 is dedicated to the prior work done regarding the development of dif-ferent regularization methods for regularizing convolutional neural networks (CNNs). This literature review helps in understanding of the existing state-of-the-art regularization meth-ods particularly for CNNs and generally for other deep learning techniques and also presents the comparison of newly developed methods with the previous ones. Chapter 3 presents the basic theory and concepts required to understand the thesis work. This chapter also gives explanation of the choice of methods used for the experimental work. Chapter 4 explains the datasets, methods, experimental environment, and framework used for building CNNs.

Results obtained from the experiments are presented in Chapter 5. Chapter 6 presents some discussion on the results obtained. Finally the conclusions drawn from the thesis work are presented in chapter 7.

2 Literature Review

This section reviews prior work done in the field of regularization in deep learning generally and in convolutional neural network particularly. The review on one hand, provides a detailed picture of state-of-the-art regularization techniques applied to convolutional neural networks and on other hand reports the latest regularization methods which are developed recently to address overfitting in deep learning. Comparison of newly developed regularization methods with the older ones is also presented in this chapter.

Convolutional neural networks (CNNs) are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014, deep convolutional networks started to become mainstream, resulting substantial gains in various benchmarks. CNNs are the best suited deep learning technique for vision based tasks because of its architecture, its property of automatic feature extraction and its local connectivity (each neuron in CNNs is connected to small subset of input) which reduces the number of parameters in the network and increases the time complexity of the training process. Apart from all these good sides of CNNs, one main problem is overfitting (Tripathy and Jadeja 2015). Overfitting occurs, when the model does not generalize well from training data to unseen data. Regularization is one of the key ingredients of deep learning (Goodfellow, Bengio, and Courville 2016), which allows the model to generalize well to unseen data.

Dropout (Geoffrey Hinton et al. 2012) is one of the most popular regularization method used in convolutional neural networks. Dropout addresses the issue of overfitting by randomly dropping units with their connections from neural networks during training. This dropping out minimizes co-adapting. Dropout makes every neuron to be able to work independently.

As a result, the network performs independent of the small number of neurons (Srivastava et al. 2014). Dropout was a key ingredient of the systems that won learning competitions like ImageNet classification (Krizhevsky, Sutskever, and Hinton 2012) and the Merck molecu-lar activity challenge at www.kaggle.com. Dropout outperformed other methods with great accuracy in all these competitions. Dropout in (Krizhevsky, Sutskever, and Hinton 2012) and (Simonyan and Zisserman 2015) have produced state-of-the-art results. Some variants of dropout method are also proposed that offered improved empirical results and theoretical

Figure 1. (a) A neural net without dropout, (b) A neural net with dropout (Srivastava et al. 2014)

motivation.Fast dropout(Wang and Manning 2012) was proposed to train a neural network with dropout without actually sampling, thereby using all the data efficiently. Thus dropping out the final hidden layer of the neural network. Fast dropout resulted in improved empirical results for deep neural networks when tested with different datasets. Figure 1 shows a neural net without and with dropout.

DropConnect(Wan et al. 2013) is the generalization of dropout, which helps regularize large neural network models. In case of DropConnect fully connected layer becomes a sparsely connected layer and the connections are chosen randomly during the training time. Drop-Connect produced outstanding results on variety of standard benchmarks. One of the as-tonishing research work, which proposed a deep convolutional neural network architecture called inception, used dropconnect and produced great results (Szegedy et al. 2014). Drop-Part (Tomczak 2013) is the generalization of dropconnect. In DropPart Beta distribution instead of Bernoulli distribution is used. This method was proposed in a study of predicting breast cancer recurrence using Classification Restricted Botlzmann Machine (classRBM).

The research was done with the real life dataset consisting of 949 breast cancer cases. Al-though, the study was done by using classRBM, but author has claimed that it can be applied to deep neural networks. A comparison of dropout and dropconnect presented in (Wan et al. 2013) is shown in figure 2.

Figure 2. Comparison of dropout and dropconnect (Wan et al. 2013)

Standout (Lei Ba, Kiros, and Hinton 2016) proposed as adaptive dropout network, when evaluated on MNIST and Norb datasets, yielded better results than other feature learning methods including denoising auto-encoders, and standard dropout. DropAll (Frazão and Alexandre 2014) is the generalization of Dropout and DropConnect for the regularization of the fully connected layers within convolutional neural networks. DropAll has both the properties of Dropout and DropConnect that is, we can drop randomly selected subset of activations or we can drop randomly a subset of weights. So with DropAll it is possible to perform both methods (DropOut and DropConnect). DropAll improved the classification errors of networks trained with DropOut and DropConnect on a common image classification datasets. Dropout is studied from Bayesian standpoint Bayesian Dropout (Maeda 2014) and the research reveals that Bayesian interpretation enables to optimize the dropout rate.

Random Dropout(Bouthillier et al. 2015) is the interpretation of dropout as prior-knowledge free data augmentation. Random dropout is described as the procedure to generate samples by back- projecting the dropout noise into the input space. The research is carried by training feed forward neural networks using MNIST and CIFAR 10 datasets and produced improved dropout results without adding significant computational cost. Another research work (Gal and Ghahramani0 2016) proposed tools to model uncertainty with dropout and developed a framework using dropout training in deep neural networks as approximate Bayesian infer-ence in deep Gaussian processes.

Curriculum Dropout(Morerio et al. 2017) is a time scheduling for the probability of re-taining neurons in the network and offers adaptive regularization scheme, that increases the difficulty of optimization problem smoothly. This idea is originated from curriculum learning which means starting easy and then adaptively increasing the difficulty of the learning prob-lem. Curriculum dropout has provided better smooth initialization and weight optimization than standard dropout for training convolutional neural networks (CNNs) and Multi-Layer Perceptrons (MLPs) for image classification on different datasets.

Channel-Out(Wang and JaJa 2013) is a network architecture, which addresses the problem of high level of inference and less utility of network capacity. Dropout encodes all patterns to each network capacity bin, resulting in full use of network capacity but high level of inference. While the sparse pathway regularization methods result in least inference but leads to waste of network capacity. Channel-out is the combination of dropout and sparse pathway encoding to utilize the full network capacity and avoid inference at the same time.

The study concluded that the sparse pathway encoding would be effective for designing robust deep networks.

Stochastic Pooling(Zeiler and Fergus 2013) is a novel and effective regularization method for regularizing large convolutional neural networks(CNNs). In this method conventional pooling is replaced by stochastic procedure and the activation within each pooling is picked randomly according to multinomial distribution. Stochastic pooling is tested on a variety of benchmark image datasets (MNIST, CIFAR10, CIFAR100, street view house numbers) and it is proved that this method is not only effective but also does not require any hyper-parameter tuning. This technique can be applied with other regularization techniques such as dropout, weight decay, and data augmentation with negligible computation overhead.

Batch Normalization (Batch Norm or BN) (Ioffe and Szegedy 2015) has been emerged as a very effective regularization method in deep learning. BN addresses the problem of covariate shift by normalizing the features by the mean and variance computed within a mini-batch, and allows high learning rates and does not much care about initialization. This milestone technique has achieved same accuracy with less training steps, when applied to a state-of-the-art image classification model (Ioffe and Szegedy 2015). A recent study about batch normalization (Santurkar et al. 2018) unveils the real impact of batch normalization

on training deep neural networks other than internal covariate shift (ICS). The study shows that the batch normalization reparametrizes the optimization problem to make it more sta-ble and smooth. So BN provides the smoothness to the optimaization landscape.Thus BN makes the gradients more reliable and predictive, makes the training significantly faster and less sensitive to hyperparameter choices (Santurkar et al. 2018). The batch normalization is playing vital role in computer vision tasks and has been adopted by all major deep learning frameworks (He et al. 2015). In “Rethinking the inception architecture for computer vi-sion” (Szegedy et al. 2014), very deep convolutional neural networks are trained with batch normalization and benchmarked their methods on the ILSVRC 2012 classification challenge.

Significance of batch normalization has been highlighted in (Liao and Carneiro 2015), which shows that layer normalization in deep networks with piecewise linear activation functions such as ReLu, leaky ReLu and parametric ReLU is of great importance. In their experiments on MNIST, CIFAR10, CIFAR100 and SVHN datasets they found that applying batch nor-malization before the nonlinear activations is of key importance in order to accelerate the training and achieve higher accuracies. Batch normalization method is not well suited to recurrent models such as LSTM and reinforcement learning or generative models.

Layer Normalization(LN) (Lei Ba, Kiros, and Hinton 2016) is the advanced form of BN.

LN addresses the problem of computation time in training the state-of-the-art deep neural networks, as they are computationally expensive. This method is very effective at hidden state dynamics in recurrent networks. Batch normalization reduces the training time in feed forward neural networks but because of its dependency on mini-batch size, it is not clear how to apply on recurrent neural networks. Unlike batch normalization, the layer normal-ization substantially reduces the training time by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case.

Weight Normalization(WN) (Salimans and Kingma 2016) is the reparametrization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. Weight normalization is the inspiration from batch normalization but does not introduces its dependencies on mini-batch. This advanced method has advantage over batch normalization because of its less computation overhead which permits more optimization

steps taken in same amount of time and speed up convergence of stochastic gradient descent.

Unlike batch normalization, weight normalization is well suited to recurrent models such as LSTM and reinforcement learning or generative models. Comparison study between batch normalization and weight normalization for the large scale image classification problem (i.e, ResNet-50 on ImageNet) shows that batch normalization has much stronger and stable reg-ularization effect than weight normalization (WN). Weight normalization is thus limited to shallow networks and can not replace batch normalization for deep neural networks (Gitman and Ginsburg 2017).

Orthogonal Weight Normalization(Huang et al. 2017) is the generalization of square or-thogonal matrix to oror-thogonal square matrix. This technique of normalization is well appli-cable to recurrent neural networks (RNNs) and feed forward neural networks (FFNs). Or-thogonal weight normalization solves the problem of optimization over multiple dependent stiefel manifolds (OMDSM). Methodology of ortogonal weight normalization has improved the performance of the state-of-the-art networks, including residual networks and inception on ImageNet and CIFAR datasets.

Group Normalization(GN) (Wu and He 2018) is a simple and effective alternative to batch normalization. In case of batch normalization, normalizing along the batch direction in-troduces problems and the BN‘s error increases with the decrease in the batch size. This problem limits the application of BN to the computer vision tasks including video, detection and segmentation which require small batches. Group normalization fixes this problem by dividing the channels into groups and computes the mean and variance within each group for normalization. Unlike BN, GN‘s accuracy is stable over wide range of batch sizes and its computation is independent of batch size. Group normalization can be easily transferred from training to fine tuning. GN has given 10 percent less error than BN, when used for ResNet-50 trained on ImageNet. GN can effectively replace the strong BN in a variety of computer vision tasks like object detection and segmentation in COCO and, video classifi-cation in Kinetics according to the study presented in (Wu and He 2018).

L1, L2 Regularization/kernel regularization, L1 regularization/Lasso (Tibshirani 1996) has been used to attain sparse solutions. L1 regularization with its variants such as group sparsity regularization seem to be propitious for deep neural networks (Wen et al. 2016) in

order to attain reduced computation and power consumption. On the other hand L2 regular-ization smooths the parameter distribution and thus reducing the magnitude of parameters resulting in less prone to overfitting model. L2 regularization plays important role in train-ing deep neural networks (Krizhevsky and Hinton 2012), and has achieved high performance when combined with dropout regularization (Srivastava et al. 2014). However, it is studied that L2 regularization has no effect of regularization when combined with batch normaliza-tion (Laarhoven 2017).

Shakeout(Kang, Li, and Tao 2016) is a simple and effective regularization method for train-ing deep neural networks. Unlike dropout which works by impostrain-ing an L2 regularizer on weights (Wager, Wang, and Liang 2013), shakeout uses a combination of L1 regularization and L2 regularization imposed on the weights. Empirical evaluation of shakeout regulariza-tion on MNIST and CIFAR10 datasets shows that shakeout reduced overfitting effectively.

Shakeout is equivalent to introducing elastic net-like regularization.

Cutout(DeVries and Taylor 2017) is a simple but effective regularization method to improve the robustness and performance of the convolutional neural networks. Cutout is a technique of randomly masking out square regions of input during training. This method can be easily implemented with existing forms of data augmentation and other regularization techniques to improve the performance of convolutional neural networks. Cutout is an extension of dropout but, the comparison among both methods shows that cutout forces model to take the full image context consideration rather than focusing on few visual features. Another major difference between cutout and dropout is that, the units are dropped at the input stage rather than in the intermediate layers. In this way the visual features removed from input layers are correspondingly removed from all subsequent feature maps. While in case of dropout, each feature map is considered individually, so the features randomly removed from one feature map may still be present in others. These inconsistencies produces noise, therefore forces network to become more robust to noisy inputs. Hence cutout is much closer to the data augmentation than dropout, as its not creating noise instead producing images that are novel to the network. Cutout regularization has produced state-of-the-art results when evaluated on the CIFAR-10, CIFAR-100, and SVHN biomarkers.

Shake-Shake Regularization(Gastaldi 2017) is an attempt to produce softer augmentation

than random flips or crops. The idea of shake-shake is to change standard summation of residual branches by a stochastic affine combination in 3-branch Res-Net. When empirically tested on CIFAR10 dataset, shake shake regularization shows great performance. Compari-son of shakeout and shake-shake shows that both methods use the idea of replacing bernoulli variables by scaling coefficients. While starting point for shakeout is dropout and that of shake-shake is a mix of of FractalNet drop-path and Stochastic Depth. Shakeout keeps co-efficients same between forward and backward passes whereas shake-shake updates them before each pass. Shake-shake works by adding up 2 residual flows and a skip connection whereas shakeout requires only one flow.

Early stoppingis a very effective and simple technique. It is done by stopping the training process before the model starts to fit. In the beginning of the training, the error value falls for both the train and test sets. But with the passage of time, the train set error continue to fall, but the test set error value starts to increase to overfit the training data. Early stopping is achieved by training a network until it starts to overfits and then returning to the point, which generated the minimum error for the test set (Goodfellow, Bengio, and Courville 2016).

However, it is very expensive and exhausted task to keep an eye on each iteration. Instead, more practical approach is to monitor both the test and train errors and stop training when the test error is not decreasing anymore after few epochs, for example 10 epochs. Figure3 shows that training error goes on decreasing while validation error starts increasing after certain number of epochs. So at the point after which the validation error start increasing the training should stop.

Data augmentationis another very effective regularization technique. Ideally, the best way to train a neural network like CNNs is to have large data. But in practical, data is always limited. The dataset is simulated or artificially created by applying small transformations to the training data that mimic variations that can be appeared in unseen data and are not going to change the class of the example. In this way we can make the network more robust. In case of specific classification tasks like speech recognition (Jaitly and Hinton 2013), data augmentation is very effective method of regularization. For images, data augmentation can be done by changing the orientation (up, down, left, right) of the pixels by 1, by changing color saturation, by multiple cropping of the image, or by rotating an image by a degree

Figure 3. Graphical representation of early stopping

or two. In case of image rotations, large rotations are not recommended. Because it can change the digit of image. A large rotation can change a digit of image 9 into 6 (Goodfellow, Bengio, and Courville 2016). Fancy PCA is another way of data augmentation introduced in AlexNet in 2012. Noise injection to the input of neural networks can also be seen as a form of data augmentation (Sietsma and Dow 1991).

3 Theory

Advanced Machine learning and deep learning techniques are providing cutting edge accu-racy in performing computer vision tasks. Convolutional neural networks have earned big fame in solving tasks like image classification, image segmentation and registration, image retrieval, and object detection. Because of the great utility of the CNNs, researchers of the field are concentrating more on enhancing the performance of the CNNs. Therefore devel-opment of new regularization methods is a key to high performance of CNNs. As this thesis work focuses on the regularization for convolutional neural networks so, the theory will present the concepts related to the thesis work. This section will introduce image classifica-tion and deep learning on a general level. Then neural networks are explained. Convoluclassifica-tional neural networks are explained in more detail and finally, the regularization methods applied to CNNs model for image classification in this thesis work are explained.

In document The Impact of Regularization on Convolutional Neural Networks (sivua 12-22)