• Ei tuloksia

Artificial neural networks consist of artificial neurons and connections between them.

These are typically stacked in neural layer (orlayer in short) structures – structures with possibly thousands of artificial neurons in parallel. An artificial neuron is a computational model, loosely based on neurons in human brains.

Neurons in human brains are cells that communicate with other neurons by transmitting electrochemical signals. A typical neuron consists of a cell body, dendrites and an axon.

Neurons receive signals with dendrites and the cell body, and transmit signals with the axon. Normally, neurons do not transmit signals most of the time – usually, only 1- 4%

of them are active at a time [52]. Instead, neurons have electrical threshold potentials: if their voltage changes with a high enough amount over a short interval, the neuron fires an all-or-nothing (binary) electrochemical pulse, the so-called action potential.

Artificial neurons (from now on written only asneuron) have a few similarities to biological neurons, but are purely mathematical constructs. A model of a neuron is depicted in the Figure 2.8. A neuron can have one or more inputs. These are multiplied, or weighted, with learnable parameters (or terms) calledweights. A neuron also contains abias term which is a special weight that does not have an input assigned to it – it is transmitted as a constant. All weighted inputs are then summed together. The output of the summation is further transmitted to anactivation function (transfer function in some literature). The

used activation function is a design choice in an ANN system – these are explained in more detail in the end of this section. Finally, the output of the activation function is the output of the neuron. The total mathematical model for a neuron is thus computed as

y(X) =φ(

n

∑︂

i=0

wixi), (2.5)

wherexiandwiare theithinput parameter and weight, respectively,φ(·)is the activation function andy(X)is the output for the input vectorX. Although the outputyis in singular, it may be used as an input to multiple neurons.

Figure 2.8. Artificial neuron model. Data are written in italics; operations are written in bold.

Considering a practical application, it is worth to note that per neuron, only the weight terms are saved into computer memory – other information is saved for larger units, e.g., activation function information is saved for each layer. Therefore, as the amount of inputs in a neuron are 1:1 to the number of weights, the connectedness of neurons is a major factor affecting a typical ANN model size. Weight terms are trainable, meaning that when the ANN model is being trained, these parameters are updated via backpropagation, which is covered in more detail in the subsection 2.2.5.

Activation functions

The choice of activation function depends on the desired properties of the neuron and the ANN. Identity activation function, i.e. φ(x) = x, effectively does nothing, so the output is simply a linear combination of the weighted input parameters. However, the applied activation functions usually initiate a non-linearity, meaning that a direct mapping from input to output can not be done without knowing the other input parameters. This property allows ANNs consisting of these neurons to form non-linear outputs, which in turn allow for computing nontrivial problems, such as XOR logic, which are impossible to solve with linear models [26, 34]. Next, three activation functions are introduced: tanh, ReLU and LReLU. They are visualized in the Figure 2.9.

A hyperbolic tangent (tanh) function was initially the go-to choice for an activation function

Figure 2.9. Tanh, ReLU and LReLU.Left: Hyperbolic tangent (tanh). Middle: Rectified linear unit (ReLU). Right: Leaky rectified linear unit (LReLU) with negative slopec= 0.1.

in deep learning. It is defined as

φ(x) = tanh(x) = ex−e−x

ex+e−x. (2.6)

The tanh function achieves non-linearity. However, ANNs that use tanh as the activa-tion funcactiva-tion suffer from a vanishing gradient problem when training with gradient-based learning methods and backpropagation, since the slope (∆x∆y) is close to zero outside the proximity of the origin [34]. Henceforth, the current default in ANNs is considered to be rectified linear unit (ReLU) [34]. This transition occurred after the authors in [32]

published results that demonstrated that the models trained with ReLU showed superior performance. ReLU is a simplistic rectifier activation function, defined as

φ(x) = max(0, x). (2.7)

Since ReLU is very similar to linear functions, it maintains the properties that make linear models easy to optimize with gradient-based methods. Nonetheless, there are some downsides to ReLU activation functions – most notably, the "dying ReLU" problem, which refers to scenarios where a large number of neurons using ReLU activation function only output values of y = 0. When most of the neurons return output of zero, the gradients can no longer flow with backpropagation method and the weights will not get updated.

Eventually, a large portion of the ANN stops learning further. Additionally, as the slope of ReLU in the negative input range is also zero, the neuron will not recover. To solve this problem, aleaky rectified linear unit(LReLU) [59] can be used instead. It is defined as

φ(x) =

cx, x <0 x, x≥0,

(2.8)

where c is the slope term for the negative part of the function. c = 0.01 is a common value, but other values may be used as well. LReLU maintains the upsides of ReLU activation function, but since the slope is not zero on the negative input range, neurons that are stuck on the negative output range can recover via subsequent weight updates.

Hence, in many cases it is chosen as the activation function in ANN systems.