• Ei tuloksia

Training process of neural network can be generally summarized as a navigation problem in the loss space, using the gradient of the loss function as a guiding parameter. Loss space is shaped by the number of tunable parameters, each combination of parameters representing a location in the multidimensional space. Loss function is a function used to determine current error, and the task is to minimize the error.

The choice or even custom definition of the loss function is probably one of the most important choices considering the training process. Probably one of the most common lost functions is the simple L2-loss which is defined as

L2Loss=∑︂

(Y T ruei−Y P redictedi)2, (3.8)

where Y T rue is the ground truth and Y P redicted is the output from the neural net-work. This definition works relatively well for simple regression problems. However, in many of the image enhancing or audio processing tasks this definition does not work well, since the loss is much harder to define. For example, when comparing the similar-ity of two images, calculating simple difference between pixels will not correlate with the visual evaluation of the quality very well.

3.5.1 Stochastic gradient descent

As previously mentioned, principle of the gradient descent is to minimize the loss function by traveling on the direction where the values of the loss function decrease the most.

Mathematically the state transition of the weights is formulated as

wn+1 =wn−λ· ∇Loss(wn), (3.9)

wherewnis the current state of the weights andwn+1is the next state of the weights after the update. Parameterλis the so called learning factor, which essentially defines the size of the weight updates. Too smallλ will make the training process very long-lasting and too high will make the algorithm to diverge. Example of the descent process is shown in figure 3.14.

Problem with the "normal" gradient descent is that calculating gradients for all of the sam-ples in training set before each step gets quickly computationally too expensive. Stochas-tic gradient descent uses the statisStochas-tical concept of sample. It picks only some part of the training set that statistically represent the training data and calculates the gradient based on the sample instead of the whole training set. Most used variation of the stochastic gradient descent is to calculate the gradient for randomly selected batch.

Very common modification to make the method more robust is the addition of momentum.

This is achieved for adding term to represent the dimension of previous weight update.

With the Stochastic Gradient Descent with Momentum (SGDM) the state transition of the weightswnis mathematically formed as

wn+1 =wn−λ· ∇Loss(wn) +α∆wn, (3.10)

Figure 3.14.Example of gradient descent process.

where ∆wn represents the gradient of the weight update on previous step and α is the momentum parameter.

3.5.2 L2-regularization

Figure 3.15.L2-regularization effect with differentλvalues. [15]

Challenge of the training process is that the loss space is extremely complicated and consists of large number of local minimums where the training process gets easily stuck.

Probably the most common method is the L2-regularization. Principle of the method is to modify the L2-loss function by adding penalty for extreme weights. Regularized L2-loss is defined

L2Loss= 1 2

∑︂(Y T ruei−Y predictedi)2+λ 2

∑︂w2i, (3.11)

where λis the regularization factor and wis the value of each weight. [11] Effect of the regularization can be seen in figure 3.15. Higher level of regularization results in smoother decision boundary.

3.5.3 Dropout

Another classical method for avoiding overfitting is the dropout [21]. Regularization of the weights is achieved by randomly zeroing out some of the nodes on each iteration. As there is certain probability that each node gets ignored during the training phase, it effectively restricts the network from learning to make decisions based on just a couple of neurons.

Principle of the method is shown in figure 3.16. Effects of the dropout regularization with various datasets such as the ImageNet, CIFAR-10 and CIFAR-100 can be found from [21].

Figure 3.16. Principle of the dropout. Image from the original paper Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from overfitting”, JMLR 2014.[21]

3.5.4 Batch normalization

One of the challenges with deep neural networks is that even though the inputs for the network are normalized in preprocessing, magnitude of the outputs between layers might vary greatly during training. Phenomena is explained to be caused by the fact that each mini batch used to calculate the gradient in stochastic gradient descent does not have

even distribution, even though the dataset as a whole has. Experiments published in [22]

show that normalizing the outputs between each layer of Convolutional Neural Network (CNN) might have great effect on the convergence of the network.

Batch normalizationBN is formally defined as

BN(x) = γ⊙x−µˆB

σˆB +β, (3.12)

for each of the samplesxin batch, whereµˆBis the mean andσˆBis the standard deviation of the batch. Parameters γ and β are learnable scaling and shift parameters that are learned during training. [11]

When the batch normalization is used with convolutional layers, the normalization is ap-plied after the convolution and before the non-linear activation function. Normalization is performed separately for each of the output channels. In other words, normalization for convolutional layer withnchannels requiresndifferent mean and variance values that are calculated for each channel over all pixel locations and over all the images in the batch.

[11]

4 PROPOSED METHODS

In this work, SVM and CNN-LSTM models were developed for error level estimation and error correction in multiradar tracker system. For estimating the current error level, SVM and CNN-LSTM models were used. For predicting the signed error correction term, a CNN-LSTM model was developed. The CNN-LSTM method predicts the signed altitude correction term and general altitude error level for the latest track update as the SVM method estimates the average error level for a short window of track updates. The CNN-LSTM model was inspired by recent methods proposed for anomaly detection and house-hold electrical load prediction in publications [23] and [24], and was selected as base for the deep learning method due to it’s suitability for processing time series data.

The motivation behind the methods is not to create more accurate radar measurement model or object kinematic model for the tracker but instead operate on a higher level and learn dependencies between the models and the tracking algorithm. The aircraft orientation in relation to the radars, i.e. the aircraft aspect angle, has high impact on the SNR values of the measurements, and it would be beneficial if the model could use this information to determine object flight states.

It is also worth noting, that the method was developed for static experiment setup specific to certain geographical location and radar measurement system. It is very likely that the method would not generalize on other imaging geometries as it is. However, the method could be used as way to adapt on specific measurement setup by improving altitude accuracy with relatively small amount of data collected just over a few days.