Regularization - The Impact of Regularization on Convolutional Neural Networks

Tuning of a neural network/model by the selection of preferred parameters, so that the model performs well and give higher accuracies is termed as regularization. Regularization is the most effective way of achieving hyperparameter optimization in deep neural networks.

(Goodfellow, Bengio, and Courville 2016)

Regularization is also defined as any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.(Goodfellow, Bengio, and Courville 2016)

Neural networks are excellent models to achieve approximation for highly complex models, but more vulnerable to overfitting. In supervised learning tasks, when the training data is used to train a neural network, it is very common that the loss is very low for the training data, but very high for the test data. This problem leads to the overfitting problem for most of the complex neural networks. The performance of the model depends on how well the

model performs on unseen data. There are three main ways to achieve the performance of the model.

• Feeding the network with more and better data and giving the neural network more information about the real function.

• Increasing the complexity of the network by adding more layers

• Regularizing the network so that the network approximates the real function.

All the above-mentioned ways are related to each other and by considering only any one of them will not solve the problem. To reduce overfitting problem in neural networks, the better approach is to have high capacity network, large and better data and regularization methods.

We have discussed in chapter 2 in detail about different regularization techniques applied to convolutional neural networks and other deep learning techniques. Here only those methods of regularization are explained, which are applied to our CNN model for comparison pur-poses in this thesis work. From the literature study, it has been evident that dropout (Geof-frey Hinton et al. 2012) is the most effective and commonly used regularization method for regularizing convolutional neural networks (Smirnov, Timoshenko, and Andrianov 2014).

Methods like DropConnect (Wan et al. 2013), standout (Lei Ba, Kiros, and Hinton 2016) , Dropall (Frazão and Alexandre 2014), random dropout (Bouthillier et al. 2015), and curricu-lum dropout (Morerio et al. 2017) are all the generalizations of dropout. Batch normalization has been emerged as another strong and effective regularization method. This method has been the vital part of the deep learning frameworks in many computer vision tasks He et al. 2015 He et al.; 2015). Third regularization method applied to CNNs for comparison purposes in this thesis work is L1 and L2 regularization (named kernel regularization in keras documentation) plays important role in regularizing deep neural networks. It has been reported in the research that L2 regularizations when combined with dropout, reduces over-fitting effectively (Krizhevsky and Hinton 2012). So, it would be interesting to see how L2 behaves when applied to CNNs without other regularization methods. Introductory explana-tion to the above menexplana-tioned three regularizaexplana-tion methods is given below.

3.7.1 Batch Normalization

In Deep neural networks, the input of each layer changes during training with the change of the parameters of the previous layer. As a result, training slows down. This phenomenon is called internal covariate shift (ICS). This problem is solved by normalizing layer inputs and the method is called Batch Normalization. During training, each mini batch is normalized using much higher learning rates (Ioffe and Szegedy 2015). Batch normalization not only reduces the overfitting, but also increases the training by allowing higher learning rates and reducing the sensitivity to the initial starting weights. Addition of mean parameter and stan-dard deviation parameter maintains the normalized outputs by keeping the mean activation close to zero and activation standard deviation nearly equal to one. For convolutional layers, normalization should follow the convolution property as well i.e, different elements of the same feature map at various locations are normalized in the same way. So, all the activations in a mini-batch are jointly normalized over all the locations, and parameters are learnt per feature map instead of per activation. In traditional deep networks, learning rate with a very high value may result in the gradients that vanish. Batch Normalization helps in minimizing such problems. Activation normalization throughout the network prevents slight changes in layer parameters from amplifying, as the data propagates through a deep network. Batch Normalization makes training more flexible to the parameter scale. Large learning rates may increase the scale of layer parameters, which can amplify the gradient during backpropaga-tion and lead to the model explosion (Ioffe and Szegedy 2015). A recent study shows that BN adds smoothness to the internal optimization problem of the network (Santurkar et al. 2018).

We applied BN as one of the regularization method to regularize our CNN model because according to the literature study, it is well established that BN is well suited to convolutional neural networks. Batch normalization has been emerged as another strong and effective regularization method. This method has been the vital part of the deep learning frameworks in many computer vision tasks (He et al. 2015). A part from internal covariate shift it also smoothes the optimization landscape (Santurkar et al. 2018), thus providing more stability and reliability to the gradients and increases the training speed significantly. One another good reason for using BN is that, batch norm is less sensitive to hyperparameter choices.

3.7.2 Dropout

Dropout (Geoffrey Hinton et al. 2012) is a simple and an effective regularization technique.

Convolutional neural networks with a massive number of parameters is a strong DL tech-nique but problem of overfitting is there to solve. Dropout addresses the issue of overfitting by randomly dropping units with their connections during training from neural networks.

This dropping out minimizes co-adapting. Dropout makes every neuron to be able to work independently, as a result the network performs independent of the small number of neurons.

During training of the neural network, some neurons with their connections are randomly dropped to prevent too much co-adaptation. A unit with all its incoming and outgoing con-nections is removed temporarily from the network and termed as a dropout. Dropout samples the Neural Network within the full neural network and on the bases of the input data, updates the parameters of the sampled network. Therefore, the exponential number of sampled net-works are dependent, as they share the parameters. Dropout reduces overfitting significantly and improves the performance of neural networks on many supervised learning tasks. This technique is not applied during testing and is always applied to input/ hidden layer nodes instead of output nodes. (Srivastava et al. 2014). The main drawback of dropout is the in-creased time consumption. The network with dropout method takes 2-3 times longer time for training than the normal standard neural network (Srivastava et al. 2014). One main reason for this time increment is that, the parameter updates are very noisy. The gradients being computed are not the gradients of the architecture that will be used at test time. Therefore, training takes a long time. But if the noise is reduced, training time can be reduced. There-fore, with high dropout, we can reduce overfitting at the cost of longer training time. It is practical to start with low dropout value like 0.2, and then fine tune. In practice, the value of dropout ratio is 0.5 (a default value), but this can be tuned on and it can be 0.1-0.5.

We opted to apply dropout to our CNN model because dropout is the most effective and com-monly used regularization method for regularizing convolutional neural networks (Smirnov, Timoshenko, and Andrianov 2014). Another good reason is that, dropout is easily applicable to CNNs. Dropout is computationally very cheap and it does not limit the type of the model.

The dropout works well with any type of model that uses distributed representations and can be trained with stochastic gradient descent.

3.7.3 Kernel regularization

L1 penalizes the absolute value of the weight and tends to drive some weights exactly to zero. L2 penalizes the square value of the weights and tends to drive all weights to smaller values. L1 and L2 regularization can be combined and this combination is called Elastic Net Regularization. L1 regularization uses most important inputs and behaves invariantly to the noisy ones. L2 regularization is preferable over L1, because L2 gives final weight vectors in small numbers. L2 is the most common type of regularization. It implements by penalizing the squared magnitude of all parameters directly in the objective. Kernel regularization has produced excellent results in terms of accuracies when applied to the convolutional neural networks for visual recognition tasks including hand written digits recognition, gender clas-sification, ethnic origin recognition and, object recognition (Yu, Xu, and Gong 2009). It has been observed that kernel regularization smooths the parameter distribution and reduces the magnitude of parameters, hence resulting in less prone to overfitting, and effective solution.

We applied kernel regularization as third regularization method applied to CNNs for com-parison purposes in this thesis work because kernel regularization plays important role in regularizing deep neural networks. It has been reported in research that L2 regularizations when combined with dropout reduces overfitting effectively (Krizhevsky and Hinton 2012).

So it would be interesting to see how L1 and L2 behave, when applied to CNNs without other regularization methods.

4 Experimental Setup: Materials and Methods

The following sections of the chapter provide an overview of the datasets, model, and test environment used for the experimental setup.

In document The Impact of Regularization on Convolutional Neural Networks (sivua 30-35)