Convolutional neural network - Efficient Deep Learning for Person Detection

y[1] = 1*0+2*0+1*1 = 1 y[2] = 1*0+2*1+1*2 = 4

Figure 3.4. The calculation of the first two values of a 1D convolution.

3.3 Convolutional neural network

Convolutional neural networks (CNNs) are used especially in image recognition tasks [7]. They utilize convolutional layers to learn the characteristics of images. Convolutional networks generally attempt to find different features in images based on small neighbor-hoods of pixels. The neighborneighbor-hoods are processed with 2D convolutions.

The convolution operation is a mathematical operation that combines two signals to get a third signal, similarly than addition combines two numbers to get a third number [9].

The convolution operation has an input signal and an impulse signal, or a convolutional kernel, that are combined to get an output signal.

The convolution operation is represented as the asterisk symbol. The convolution of the input signalx[n]with the convolutional kernelk[n]to get an output signaly[n]is denoted as

y[n] =x[n]∗k[n]. (3.11)

The operation can be thought as sliding the convolutional kernel through the input signal.

At each point the kernel’s values are multiplied with the respective input signal’s values and the multiplications are added together. It is better understood with an example of discrete signals. With an input signal of x[n] = [1,2,3,2] and kernelk[n] = [1,2,1]the output would be y[n] = [1,4,7,8,4,0]. Figure 3.4 shows the first few calculations of the convolution. One output value is calculated at each point of the signal.

A convolution with 2D information, such as images, works with the same idea as the previous 1D version. Instead of sliding a 1D kernel through the 1D input, the 2D kernel moves through the 2D image one row of pixels at a time [7]. At each pixel location, the resulting pixel value in the output image is calculated in the same way by multiplying the respective kernel values with the small neighborhood of values in the input image. Figure 3.5 presents in practice how a convolution works with an image array.

0 1 1 2 2 3 1 1

Figure 3.5.2D convolution. The kernel slides through each part of the input.

The kernel weight values of a convolutional neural network also consider the activation function of the convolutional layer and the bias value. Thus, the output value of a convo-lutional operation with the input neighborhoodXand kernelKwould be

a(K∗X+b), (3.12)

wherea is the activation function and b the activation bias. Similar activation functions could be used for calculating the values inside a convolutional kernel and an MLP net-work’s neuron. However, the ReLU (Rectified Linear Unit) activation function have been widely used as the activation functions for convolutional neural networks.

The output of a network’s layer’s convolution operation is a so-called feature map. Feature maps can also be called activation maps because they contain the activations of the layer.

A convolutional layer in a neural network consists of multiple feature maps, each with a different learned kernel.

The number of feature maps can be expressed as the number of channels in the layer.

The feature maps learn to distinguish different features from the image, such as edges or shapes. However, the kernels are quite small regarding the input image, and they can learn only to distinguish very small features the size of a few pixels. For the kernel to learn more general features of an image, the input needs to be made smaller.

Pooling layers are used to simplify or diminish the outputs of the convolutional layers.

The output feature map of the convolutional layer is given to the pooling layer and it summarizes a small region of the feature map. With 2x2 pooling, the pooling layer takes areas of 2x2 pixels and summarizes them to one pixel value in the pooling layer’s output feature map. After a pooling layer, the feature map has halved dimension and contains a quarter of the input’s values. The pooled information can then be given to the next convolutional layer.

One common pooling method is max-pooling. It outputs the maximum values of the input areas. Max-pooling, like other pooling methods, aims to find the searched features in regions of the image. The searched feature means features such as edges or basic shapes, which the kernel’s values has been updated to recognize. If the region contains

Convolution

Figure 3.6. An example deep convolutional neural network architecture for classifying images between 10 classes. The numbers above the layers represent the layer dimen-sions.

the searched feature, it has a large activation. The large activation is noticed by the max-pooling layer as it picks the highest activation values from the feature map.

A functional convolutional neural network is formed by alternating between convolutional layers and pooling layers. The first layers that produce large feature maps can be taught to learn very rough features in the image, such as edges, corners and small shapes. The layers at the end of the network have small feature maps that cover large parts of the image, and they can learn to distinguish complex shapes.

After the input dimensions have been cut with a pooling layer, convolutional layers often increase the number of channels. The size of the inputs keeps getting smaller, but the number of the learned feature maps gets larger. Fewer feature maps are needed to for the basic low-level features. The high-level features are more varying and benefit from a larger number of channels.

The output of the last convolutional layer is difficult to interpret as the image class. Be-cause of this, one or several fully connected hidden layers can be added to the end of the convolutional network. The fully connected layers are connected to all the values in the last convolutional layer. The final classification probabilities can then be given as an output with a softmax layer at the end of the network.

In Figure 3.6, a simple example of a convolutional neural network architecture is pre-sented. The network can be trained to classify small 32x32 sized input images to 10 classes. Changing the kernel size, the number of layers or other parts of the architecture will affect the performance of the network.

As presented earlier, the kernels in different parts of a convolutional neural network learn to find different features. Figure 3.7 presents an example set of features from the second and third layers of a 5-layer convolutional neural network. In the first image, one of the kernels in the second layer seems to recognize corner-like shapes. In the second image, a kernel in the third layer can recognize roughly the shape of a person’s upper body.

There is a max pooling layer between the two convolutional layers which results in the

Figure 3.7.Example features of feature maps in different layers of a convolutional neural network and their respective location in the input image. Adapted from [10].

different input sizes. Layer 2 has an input size of 26x26 pixels and layer 3 has an input size of 13x13 pixels.

The presented types of layers and the ReLU activation are just examples to be used for a deep convolutional neural network. The best approach and architecture are depen-dent on the data and application. There are many different architectures with different approaches to get powerful classification models. In the next chapter some of these network architectures are presented.

4 DEEP CONVOLUTIONAL NETWORK ARCHITECTURES

There is a huge amount of different convolutional neural network architectures. Some are powerful and require a large amount of computing power while others have a worse accuracy but are much faster. Four common deep convolutional neural network architec-tures that present efficiency in their architecture are researched. Table 4.1 presents the main features that are developed in each of the architectures.

In document Efficient Deep Learning for Person Detection (sivua 21-25)