Training and Backpropagation - Convolutional Neural Networks

2. Convolutional Neural Networks

2.2 Training and Backpropagation

The original advancement for training neural networks with back-propagation [9] and de-velopment of processing units were crucial for neural networks to work outside of theory.

The idea behind machine learning is to minimize the error for wanted quality function(Q) in respect to weightwfor the input fromndatapoints:

Q(w) = 1

This is a sum-minimization problem, where the quality function must be evaluated in each neuron when minimizing the error of the whole network by propagating the error back-wards in the network. Furthermore, the increasing number of neurons or weights wand datapoints nquickly causes the minimization problem to be extremely difficult to be cal-culated efficiently in practice. Therefore, stochastic methods are used to evaluate inter-mediatelossfor fewer number of datapointsiwhich is also called abatch. This is done in Stochastic Gradient Descent (SGD) [10] by iteratively evaluating the gradient for the loss ini samples:

where ∇ is the step size which is also called the learning rate. There has also been multiple updates for the SGD algorithm such asAdaptive Gradient Algorithm (AdaGrad) andAdaptive Moment Estimation (Adam) [11]. AdaGrad introduces individualized learn-ing rates for parameters and Adam combines the variable learnlearn-ing rates with second moment gradients of the loss function.

For image based problems common quality functions to minimize are for example the

Mean Absolute Error (MAE):

where theY_iis a vector of observed values andYˆ

iare the predictions. MAE is also often called l1 loss and MSE as l2 loss in machine learning.

However, the choice of loss function is not simple and is an important part of optimization.

Furthermore, optimizing for a loss function in neural networks may not yield the best results when evaluating the model for the specific loss. For example, in the work by Bako et al. [12] when denoising Monte Carlo rendered images 5 different loss functions were tested l1, relative l1, l2, relative l2 and Structural Similarity Index Measure (SSIM) [13].

It was found that using the l1 loss in training resulted in best performance for training convergence and validation for each metric, even when the model was optimized for a specific loss. This is an interesting topic and many innovative new loss functions are found and used for variety of tasks. One interesting loss function is the so-called Generative Adversarial Network (GAN) loss and it is employed e.g. in the work by Ledig et al. [14]

in which perceptual loss function was derived from high-level feature maps of a another image based neural network.

The parameters such as the learning rate, the loss function and the different topological designs for the model are also calledhyperparameters.

2.3 Convolutional Neural Networks

A specialized case for neural networks are Convolutional Neural Networks (CNN) [15].

Unlike conventional fully connected layers, the convolutional layers are based on human vision [16]. The main difference with fully connected layers is that the convolutional layers are only partly connected to few inputs in a small window. An example CNN is illustrated in figure 2.3.

The convolutional layers became popular first for image recognition and classification tasks such as ImageNet [17]. The architecture of the CNN was usually that first the con-volutional layers are used to extract interesting features from the input images and then the feature set is concatenated to a dense fully connected layer. But after this, also archi-tectures for image generation and reconstruction have appeared, for example the UNet

Figure 2.3. Convolutional neural network example for hand-written digit classification.

First there are several convolutional layers with different resolutions achieved by pooling and this is later flattened to a dense fully connected network with ten outputs with different class probabilities for classes 0-9 [15].

[18]. Traditionally the image size has been really small for image recognition and classifi-cation problems and the inputs for the networks have usually been downscaled versions of the original images. For example, in ImageNet classification tasks the images are usually cropped to 256x256 or 224x224 images. Also, for classification and recognition tasks, these convolutional features are further run in multiple resolutions by usually using pooling methods to downscale the original resolution in later convolutional layers. The downscaling is also illustrated in figure 2.3 where after the first layer of convolutions the resolution of the features is halved in two dimension for each subsequent convolutional layer.

However, in image denoising or super-resolution problems downscaling the original im-age would cause problems for the network to perform well for the task. Moreover, the input size affects the computational cost, and the receptive field must be thought out for the task. The receptive field for a single standard convolutional layer is the size of the con-volutional kernel and the subsequent layers increase this only by the new kernel size. For classification tasks the pooling layer works to increase the receptive field, but for recon-struction problem where the output resolution is the same as the input image this leads to loss of information in the low-resolution pooled layers. For this problem there are different solutions. UNet [18] tries to solve this problem with pooling layers and skip connections.

Dilated convolutions [19] try to solve this problem using convolutions in a dithering pattern the same way as Á Trous which is described in section 3.5.2. Some methods just try to increase the convolutional kernels size to increase the receptive field like done by Shi et al. in the work for image super-resolution [20].

As the receptive field describes the area the CNN layer ’sees’ [16] the receptive field for a single layer CNNr_i is just the size of the kernelk itself. After the first layer applying new

Standard convolution 5x5 Standard convolution 3x3 Standard convolution 3x3

Figure 2.4.Illustration of standard CNN receptive field.

convolutions always increases the receptive field by the kernel radius as the receptive field of the samples on the edge of the kernel are also caught. Furthermore, the receptive field for convolutional layerican be written as:

r_i =ri−1 + (k−1). (2.8)

The receptive field is illustrated also in figure 2.4, where the first layer has 5x5 convo-lutions and the two next subsequent layer 3x3 convoconvo-lutions so the receptive field can be calculated with the equation 2.8 asr₁ = 5,r₂ = 5 + (3−1) = 7andr₃ = 7 + (3−1) = 9. From this example it can be seen that to actually increase the receptive field efficiently just stacking standard convolutions may pose problems.

As the convolutional neural network can be thought as only partially connected neural networks they greatly reduce the complexity of the network compared to fully connected network. The computational cost of standard convolutions for single layer can be calcu-lated by [21]:

C_K·C_K·I_W ·I_H ·I_D·O_D, (2.9) where C_K is the dimension of the convolutional kernel, I_W, I_H, I_D are the input feature dimensions width, height, depth respectively (equivalent toI in the fully connected equa-tion 2.3) and O_D is the number of output channels. The number of parameters in the convolutional neural network can be calculated with CK ·CK ·ID ·OD as the convolu-tional kernels are used as a sliding window over the two spatial dimensionsI_W·I_H. Now considering the example where a 100x100 RGB input image is used as input features and the output features are the same size and a convolutional kernel of size CK = 3is used. The computational complexity for the CNN is(3·3·(100·100·3)·3)) = 810000 compared to900000000in the case of fully connected network. The number of parame-ters is reduced to3·3·3·3 = 81 compared to900000000of fully connected network.

This reduction of complexity comes with the consideration that the receptive field for a single feature is a window of size 3x3 for the N input feature maps. In addition, because this window is slid across the feature maps the output feature maps M cannot have spa-tially discriminate features which means that the network cannot for example locally have different feature selection in different parts of the image.

The equation 2.9 is valid for 1 and 2 dimensional convolutions. Furthermore, a set of convolutions which extend the dimensionality of the convolutions to three dimensions are called 3D Convolutional Neural Networksand can be used to consider features in 3 dimensions such as temporal data [22]. However, for this thesis the temporal aspect is not considered in the experiments and therefore 3D CNNs are omitted from them. For future work 3D CNNs are an interesting direction especially when considering adding temporal data to the input set.

2.3.1 UNet

UNet was first introduced by Ronnenberger et al. for biomedical image segmentation [18].

UNet is a fully convolutional neural network which has an encoder and decoder parts il-lustrated in figure 2.5. In the encoding phase the network increases the receptive field of the convolutions by using pooling to reduce the resolution of each intermediate layer like done in multiple recognition networks. After the encoding phase when the receptive field is increased appropriately the decoder phase reconstructs the final image. In each decoding phase the CNN takes the lower resolution input and upsamples it either analyti-cally like bilinear interpolation or uses the so-called deconvolutional layers to upscale the input. Deconvolutional layers can be thought as the backwards operation of convolution [18].

This kind of network architecture with encoding and decoding can be used for example compression by itself as done by Rippel and Bourdev for real-time adaptive compression in [23]. But for good quality image reconstruction the networks lose a lot of spatial infor-mation in each consecutive layer with the pooling. The UNet solves this problem by using skip connections. Skip connections propagate the information from the same resolution layers from the encoding phase to the decoding phase. Moreover, for training purposes the skip connections also propagate the error better for the first layers to help with the vanishing gradient problem. Furthermore, the need of skip connections and deconvo-lutions forces using multiple convolutional layers for each separate resolution as not to lose spatial information when advancing in the network and for real-time application this requires more calculations and extra work.

The receptive field with pooling can be calculated with as per [19]:

r_i =ri−1+ (k−1)∗p_i, (2.10)

Figure 2.5. UNet with skip connections. UNet increases the receptive field of the net-work by reducing the resolution of the input image utilizing pooling and later reusing the intermediate outputs in later layers in the decoding phase by using skip connections[18].

Unet first layer 3x3 Unet second layer 3x3 Unet third layer 3x3

Figure 2.6. Illustration of UNet receptive field. The reduction of the pixels or the increas-ing grid size illustrates the reduction of resolution after poolincreas-ing layers.

where thep_i is the pooling factor and this is also illustrated in figure 2.6 where a pooling of size 2 is used in two dimensions.

2.3.2 Dilated Convolutions

Dilated convolutions follows the same idea as is done in Á Trous which is introduced in section 3.5.2. The convolutions are ’sparse’ in a sense that the distance between the kernel units is controlled with a dilation factor. The dilation factor effect can be seen in figure 2.7. This is an interesting way to increase the receptive field inside a CNN without losing spatial information.

The receptive field with dilated convolutions can be calculated with the formula derived from [19]:

r_i =r_i−1+ (k−1)∗d_i∗

i−1

∏︂

j=1

s_j, (2.11)

Dilated convolution with d = 1 Dilated convolution with d = 2 Dilated convolution with d = 4

Figure 2.7. Dilated convolution with dilation rate 1, 2 and 4 so that the next dilated convolution is used for the previous step thus increasing the effective spatial field size shown with red opaque [19].

wherer_iis the receptive field of the layer,d_iis the dilation factor of the layer,kis the size of the convolutional kernel and sis the stride for the convolution. So for example in the case as shown in figure 2.7, where the convolution kernelk = 3,d = 1,2,4and there is no stride the equation receptive field can be simplified with the next equation:

r_i =ri−1+ (k−1)∗d_i, (2.12) and the receptive field would ber₁ = 3,r₂ = 7andr₃ = 15as can be seen in the figure 2.7 in red opaque. It can be noted that the receptive fields from equations 2.10 and 2.12 look the same but the significant difference is that with dilated convolutions the resolution of the input is not decreased and is thus computationally more expensive but in exchange it does not lose spatial information between the layers.

In document Fast Convolutional Neural Networks for Real-Time Path Tracing Denoising (sivua 14-20)