• Ei tuloksia

Neural networks are a family of machine learning methods that operate by propagating a set of input features through a network of neurons to the output. Neural networks and, especially the deep neural networks, are very fast developing area of research and can be considered as the state of the art method for machine learning. According to the Google Scholar, neural networks acquired more citations than any other area of research in 2020[14].

Figure 3.5.Biological neuron and corresponding mathematical model.[15]

The principle behind the artificial network is the neuron model that is somewhat inspired by biological neural cell. Connections between the neural network nodes correspond to the synapses. Denderites, locations where the signals enter the cells, are modeled with input weights. Input signals are transferred through the node to the output representing the axon of a biological neuron. Biological neuron and corresponding mathematical neu-ron model are shown in figure 3.5. It is worth noting that even though all of the artificial neural network architectures share somewhat similiar concept of neurons, the area of research have branched on multiple drastically different approaches. [11]

The key difference and probably the main advantage of the neural networks when com-pared to the traditional methods, such as the SVM, is the learned feature extraction. As the traditional methods often build on top of handcrafted features that are used as input for the model, neural networks are often fed with the raw input data. Handcrafting func-tional and robust features for image or speech is especially difficult task if the model has to operate in uncontrolled environment. However, in areas such as factory automation, where the model can operate on highly controlled environment, the handcrafted features are still widely used and relevant.

3.4.1 Fully connected neural network

The original neural network structure containing only fully connected layers can be seen as a layered network of nodes, where all the nodes from the previous layer are connected to all the nodes on the next layer. Input is connected to the first layer of nodes and output

is on the last layer. Example of the fully connected neural network with two hidden layers and two output nodes can be seen in the figure 3.6. Each output node could, for example, present a probability of a processed input image belonging to a certain class.

Figure 3.6.Example of the nodes and connections of fully connected neural network.

At the simplest, output y for each neuron is calculated as vector multiplication with the equation

y=σ(W x+b), (3.3)

wherexis the vector of input values from previous layer,W is the vector of input weights for the output values of nodes from previous layers and σ is the activation function. In other words, output of the neuron is formed as non-linearly transformed weighted sum of the inputs. Additionally, a bias term b is added for each connection. The network is trained by slightly altering the weight and bias terms of the network using gradient based optimization. Error for the output is calculated against the ground truth between each pass or in small batches, after which the gradient for each weight is calculated from the output to the input each layer at a time using the chain rule. This is critical in order to know the direction in which the error decreases for each of the parameters being adjusted.

Figure 3.7.Logistic sigmoid activation function.

In practice, there are a lot of problems with vanishing and exploding gradients as the

network gets deep. In order to avoid these problems, many different activation functions and modifications to the structure have been developed. Most popular activation functions are the logistic sigmoid shown in the figure 3.7 and the Rectified Linear Unit (ReLU) unit as seen in the figure 3.8.

Figure 3.8.Rectified linear unit activation.

The capacity of neural network to represent functions depends on the number of neurons on the hidden layers and the depth of the network. Higher number of neurons allow the network to learn more complex relationships but make the network harder to train.

Tendency of the neural network to learn features that are specific only to the training examples but are not present on other samples is called overfitting. Generally, more complex the models are, the higher the risk of overfitting is. Example of the different number of hidden units is represented in figure 3.9.

Figure 3.9. Simple classification problem example using fully connected neural network with different number of hidden units. Higher number of neurons allow the model to learn more complex decision boundary.[15]

3.4.2 Convolutional layers

Even though neural networks with only fully connected layers are able to overperform tra-ditional machine learning methods, such as the SVM, in many applications, the massive number of learnable parameters is an issue. Convolutional layers drastically reduce the

Figure 3.10.Convolutional kernels of the AlexNet.[17]

number of parameters to be learned which then allows the use of deeper and more com-plex models. Additionally, multilayer convolutional structure directs the perceptive area naturally in a way that the perceptive area increases during the inference through the network towards the output classification/regression layer. The famous AlexNet neural network which crushed previous method on the ImageNet Large Scale Visual Recogni-tion Challenge in 2012 and started the current era of deep learning was a convoluRecogni-tional neural network.[16]

Instead of the perceptrons used in the fully connected layers, the convolutinal layers build on the convolutional kernels. Each kernel, for example of size 3×3, consists of filter coefficients, that are used to calculate value for the center position of the kernel as the filter is sliding over the input. Filter operates in the same way as, for example, typical averaging kernel used in image smoothing or the Laplacian kernel used for sharpening.

As the neural network is trained, the convolutional coefficients of the kernel are altered.

Example of the AlexNet convolution kernels learned during the training can be seen in the figure 3.10. If the AlexNet filter kernels are compared with, for example, the edge detecting Sobel kernel, it can be seen that some of the learned kernels in fact resemble edge detecting filters which seems intuitive for object detection.

Convolutional layers are usually paired with maxpooling which is basically an operation for downsampling the spatial resolution of the feature maps. For example, if the first con-volutinal layer of the neural network has 20 convolutional kernels, the output from the first layer is 20 images filtered with different convolutional kernels. The spatial size of the im-age is first reduced if the stride of the kernels isstride > 1, after which the maxpooling

further squeezes the spatial size of the feature maps. The ratio of how much the spatial resolution is reduced by maxpooling and stride of the convolution varies between archi-tectures. Figure 3.11 shows how the 227× 227×3 input RBG image is squeezed to 6×6×256feature vector as it passes through the convolutional feature extraction layers of the AlexNet.

Figure 3.11. AlexNet convolutional and fully connected layers. Each convolutinal layer is followed by maxpooling. Output (layer 8) has 1000 nodes, one for each class of objects in the ImageNet dataset. [18]

3.4.3 Recurrent neural networks

The difference between RNN and feed forward networks such as the AlexNet[16] is that the RNN has feedback connections effectively allowing them to have memory between input timesteps. The nodes are connected with one way connections just like in the feed forward networks, however, some of connections actually loop back from the output. Input for the next timestep consists of the loop back, which is usually called as the hidden state, and the new input. The idea of the loop back and how the information propagates over multiple time steps is illustrated in figure 3.12. Problem with RNN is that the "memory mechanism" requires the information to be passed through the network on every time step making the gradient to vanish relatively quickly and rendering the memory mechanism to be effective for just a few time steps.

Since the output is available for each timestep as seen in figure 3.12, RNN networks can be used for two general tasks. Simplest use case is the sequence-to-one, in which sequence of inputs are fed in and only the output of the last RNN cell is used as out-put. Second general use case is the sequence-to-sequence transformation, in which the outputs from each of the time steps are used.

Figure 3.12. RNN idea visualized. Input X, hidden stateh and output o for each time step.U,V,W are weights of the simple RNN network. [19]

3.4.4 Long short term memory

Long Short Term Memory (LSTM) is a type of RNN that is well suitable for processing time series data such as speech. Idea of the LSTM is to improve the RNN performance with the addition of cell state mechanism that is controlled with a forget gate. As the loop back still provides the short term memory just as in the traditional RNN, the cell state is additional structure that allows the node to store information without the loop back to provide more long term memory. Effectively this means that as the input for the next time step has now two different types of "loopbacks" the hidden state and the cell state -whereas the traditional RNN only loops back the hidden state. Difference on the structure of the nodes can be seen in figure 3.13.

Figure 3.13.Structure of a LSTM-cell. Two outputs for the next state, hidden statehand cell statec. Cell consists of the forget gateF, input gateI and output gateO. [20]

Input cell state Ct−1 for each LSTM cell is modified in two different stages, of which the first is convolution with the forget signal. Forget signalFt is formed by the previous hidden stateht−1 and the current inputxtpassed through sigmoid activationσ. It can be mathematically formed as

Ft=σ(wf ·[ht−1, xt] +bf), (3.4)

wherewf andbf are the learnable weights of the forget gate.

After the previous cell state Ct1 is formed with the forget signal, the next cell state is formed by adding information with signal C˜t formed by a convolution of two versions of the input signal,Isig andItanh. Mathematically defined as

C˜ =t σ(wIsig·[ht−1, xt] +bIsig)∗tanh(wItanh·[ht−1, xt] +btanh), (3.5)

wherewIsig,bIsig,wItanh andbtanh are the weights. Note that the intermediate signalC˜ and the input signals are named slightly differently in figure 3.13. Finally, the next cell stateCtis formed with the intermediate signals as

Ct=Ct−1∗Ft+C˜t. (3.6)

Output from the current cellotis essentially same as the hidden statehtwhich is calcu-lated as

ht=σ(wo·[ht−1, xt] +bo)∗tanh(Ct), (3.7)

wherewoandbo are the weights.