• Ei tuloksia

Training process

The concepts and building blocks of a neural network are quite simple. However, the large number of neurons and the layer-like structure makes it difficult to train. The same concepts introduced in section 2.2 for training machine learning models apply to neural networks as well. However, some additional methods are needed when training neural networks.

Initially, the weights and biases of the neurons in the network are randomly initialized, and the aim of the training process is to find the optimal values for them. The process is similar than in other machine learning algorithms, but the number of tuneable parameters is very large. With the optimal values, the network can correctly do the required task.

The loss function of a neural network can be the same as the loss functions in basic machine learning models. For example, MSE can be used to determine the accuracy of the neural network model. Due to the layered architecture of neural networks, the optimization of the neuron weights is done with a slightly more complicated fashion than with simpler machine learning models.

The minimization of the loss function is done with gradient descent. When changing variable values, the loss function moves in some way, either to the better or to the worse.

The "movement" can be seen from the gradient of the loss function. In a simple situation with only two variables, v1 and v2 affecting the loss function L, we can visualize the minimization of the loss function as in Figure 3.3.

The two variables of the model were updated, and the change in the gradient is seen as the arrow in the figure. As we can see from the direction of the arrow, the updates were in the correct direction and the loss function became smaller. Continuing approximately in the same direction could help in finding the optimal variable values.

The gradient of a loss functionLconcerningnvariables is calculated as

where T means the transpose operation. The updates vi for each variable vi can be calculated by using the gradient with

vi =vi−µ▽L, (3.7)

whereµis the learning rate. If the variables are changed as drastically as in Figure 3.3, the minimal point might not be found. The large changes would cause the result to go over the minimal point. Therefore, a learning rate is chosen to make the changes smaller.

Typical learning rate values are small, between10−1to10−5depending on the complexity and architecture of the network. The learning rate is typically adaptive and might start with a higher number and get smaller during training to fine-tune to a minimal loss.

Stochastic gradient descent is used to make learning by gradient descent faster. In stochastic gradient descent, the gradient ▽L is calculated with a small batch of train-ing inputs instead of only one traintrain-ing input. Ustrain-ing the mean of the gradients within the batch of training data makes the training process quicker and more accurate. A proper batch size is dependent on the complexity and size of the training dataset and can be adjusted to improve the fitting of the network.

The backpropagation algorithm is used to calculate the gradient of the loss function. If a neuron is given a small change at the first hidden layer of an MLP network, its output will affect all the neurons in the next layer. Furthermore, the change will continue to slightly change the inputs to all the following layers as well because of its effect on the next layer. Backpropagation aims to backtrack the small cumulative changes that the updated weights cause. Backpropagation calculates the errors of the neurons starting from the last layer and going backwards in the network’s architecture.

After the error of the output layer is calculated, the hidden layers’ errors can be back-propagated. For each of the hidden layers l, the backpropagated error δl is calculated as

wherewl−1 contains the weights of the previous layer,f( xl)

contains the outputs of the layer based on the inputs and activation function. The⊙operator denotes the Hadamard product of the two vectors, which means multiplying the values in the vectors element-wise.

After the hidden layer errors are backpropagated, the gradient of the loss function can be calculated. The gradient of a weight variable from thekth neuron in the(l−1)th layer to thejthneuron in thelthlayer is calculated as

∂L

∂wjkl =ykl+1δjl (3.9)

where ykl+1 is the output of the neuron in question. The gradient of the neuron’s bias variable is calculated as

∂L

∂bljjl. (3.10)

The Equation 3.7 can then be used to update the weight and bias values in the network based on the backpropagated gradients. This whole process is done for one batch of data at a time. When all the inputs in the training dataset have been fed through the network, one epoch is said to be completed. The network is trained for a suitable number of epochs to prevent either underfitting or overfitting.

The introduced quite small and shallow network architecture is enough for simple tasks.

To increase the performance, more than only a few hidden layers should be added to form a deep neural network. However, just adding layers does not improve the performance too much, as making the network deeper introduces problems with the gradient descent algorithm [7].

The gradients in the first hidden layers of a network tend to have a different scale of values than the last hidden layers. The deeper the network is, the larger the difference.

This makes updating the weights with gradient descent difficult. Sometimes the gradient is very low in the first layers of the network and large in the end of the network. On the other hand, the gradient might sometimes be very high in the first layers and very low in the last layers. These two problems are called the vanishing gradient and exploding gradient problem, respectively.

Both vanishing and exploding gradients are the results of having generally unstable gra-dients the deeper a neural network is. Both are the results of the fact that the gradient in the first layers are computed as the product of the latter layers. With more layers the product gets more unstable.

Changes in the activation function or gradient descent algorithm can help in avoiding unstable gradients. However, neural networks with fully connected neurons should not be used in the more complex, deep networks. There are other approaches that can be used to form a neural network. The most common deep neural network, especially for working with image data, is a deep convolutional neural network.

0 1 2