• Ei tuloksia

2.3 image segmentation

2.3.5 Neural networks

Artificial neural networks or neural networks (NNs) are a family of machine learning models that mimic the human brain’s biology and the way it operates.

These models can learn to perform specific tasks based on the provided examples.

For instance, we can feed an NN with thousands of labeled images of cars, houses, and trees and expect that it would find visual patterns in the images to identify cars, houses, and trees in other images. Neural networks are essentially a set of connected artificial neurons, where the connection strength between pairs of neurons, called model parameters, needs to be learned by model training. We train a model through a procedure called backpropagation, adjusting model parameters to minimize a cost function. Typically, neurons of one layer are connected to neurons from the previous and subsequent layers. An input signal enters the first layer, and after a sequence of non-linear transformations in the intermediate layers, called hidden layers, it is received from the output layer. An NN can have a minimum of three layers: one input layer, one hidden layer, and one output layer; however, we can add more hidden layers. An NN with an architecture containing more than three layers is called a deep neural network (DNN). Theoretically, a shallow network with enough neurons in the hidden layer can represent any function, but in practice, DNNs have a better performance because each layer adds its level of non-linearity, increasing the abstraction capacity. This section introduces multilayer perceptron (MLP) basics and the convolutional neural networks (CNNs) models, a specialized NN used in computer vision tasks.

2.3.5.1 Multilayer perceptron

An MLP is a model that applies a sequence of non-linear transformations to the input data (Figure 3a). Formally, we describe an MLP with 𝐿 layers as follows (we use similar notations as in (Farabet, 2013)):

𝑦 = 𝑓(π‘₯, πœƒ) = β„ŽπΏ (8)

β„Žπ‘™= πœŽπ‘™(π‘Šπ‘™β„Žπ‘™βˆ’1+ 𝑏𝑙), βˆ€π‘™ ∈ {1, … , 𝐿} (9)

β„Ž0= π‘₯ (10)

where πœƒ = {π‘Šπ‘™, 𝑏𝑙}, βˆ€π‘™ ∈ {1, … , 𝐿} is the set of trainable parameters consisting of bias parameters 𝑏𝑙 and trainable weight parameters π‘Šπ‘™ for each layer 𝑙, π‘₯ ∈ ℝ𝑑𝑖𝑛 is the input vector, 𝑦 ∈ β„π‘‘π‘œπ‘’π‘‘ is the output vector, and πœŽπ‘™ is a point-wise non-linearity at layer 𝑙. The most common non-linear activation functions for πœŽπ‘™, βˆ€π‘™ ∈ {1, … , 𝐿 βˆ’ 1}

are the hyperbolic tangent and the rectified linear unit. Without activation functions, the whole system is a set of linear operations and could be written as single matrix multiplication.

2.3.5.2 Convolutional neural networks

CNNs are a subset of DNNs that take images as the input, the basic linear layers are replaced with convolutional layers, and a spatial pooling function follows non-Figure 3. (a) A multilayer perceptron model with two hidden layers. (b) The architecture of a convolutional neural network. This figure is generated by adapting the code from Gavin Weiguang Ding’s Github.

linear activations (Figure 3b). CNNs were developed because solving image-related tasks using a DNN for an even a modest size image, e.g., 200Γ—200Γ—3 with three color channels, contains 120,000 units of information. A single fully connected neuron on the first layer of an NN would require 120,000 parameters. Having multiple layers in the network, the number of parameters in such a model would quickly increase, resulting in overfitting and computational problems. The architecture of CNNs, on the other hand, possesses two fundamental properties useful for image applications: spatially shared weights and spatial pooling.

Another advantage of CNNs is that by sliding kernels over images, a kernel is a matrix that performs a dot product with the input, we can deal with arbitrary-sized images compared to fixed-sized inputs of MLPs. For a general form CNN, we can write:

π‘Œ = 𝑓(𝑋, πœƒ) = 𝐻𝐿 (11)

𝐻𝑙= π‘π‘œπ‘œπ‘™π‘™(πœŽπ‘™(π‘Šπ‘™π»π‘™βˆ’1+ 𝑏𝑙)), βˆ€π‘™ ∈ {1, … , 𝐿} (12)

𝐻0= 𝐼 (13)

where πœƒ = {π‘Šπ‘™, 𝑏𝑙}, βˆ€π‘™ ∈ {1, … , 𝐿} is the set of trainable parameters, 𝑋 is the input array of vectors (image 𝐼 is an array of pixels), π‘Œ is an array of output vectors, πœŽπ‘™ is a point-wise non-linearity at layer 𝑙, and π‘π‘œπ‘œπ‘™ is an optional pooling function at layer 𝑙.

The matrices π‘Šπ‘™ are Toeplitz matrices in CNNs, while these matrices can take any general form in MLPs. Each hidden unit array 𝐻𝑙 is a convolution between π‘Šπ‘™

and its previous hidden unit π»π‘™βˆ’1, transformed through a point-wise non-linearity and possible pooling layer, written as:

𝐻𝑙𝑝= π‘π‘œπ‘œπ‘™π‘™(πœŽπ‘™(𝑏𝑙𝑝+ βˆ‘ π‘Šπ‘™π‘π‘žβˆ— π»π‘™βˆ’1,π‘ž π‘žπœ–π‘π‘Žπ‘Ÿπ‘’π‘›π‘‘π‘ (𝑝)

)), (14)

where 𝐻𝑙𝑝 is the π‘π‘‘β„Ž component, called a feature map, of the π‘™π‘‘β„Ž feature vector maps 𝐻𝑙. The output of a CNN is connected to a final activation function, e.g., a softmax function, so that the optimization becomes a maximum a posteriori estimation. In the remainder of this section, we describe the three building blocks of a CNN: the convolutional layer, activation function, and pooling function. Then, we describe the learning procedure of NNs, including the loss function and optimization. We finalize this section, briefly describing two important CNN architectures: fully convolutional network (FCN) and U-Net.

Convolutional layer. A convolutional layer cross-correlates the input and kernel and adds a scalar bias to produce an output. A convolutional layer has two parameters: the kernel and scalar bias. We typically initialize the kernels randomly when training models. The input of every layer is a 3D array with 𝑛1 2D feature

maps of size 𝑛2Γ— 𝑛3. We denote each component as π‘₯π‘–π‘—π‘˜ and each feature map as π‘₯𝑖. The output 𝑦 is a 3D array with π‘š1 feature maps of size π‘š2Γ— π‘š3. A trainable kernel 𝑀𝑖𝑗 of size π‘˜1Γ— π‘˜2 and bias parameter 𝑏𝑖 connect an input feature map π‘₯𝑖 with an output feature map 𝑦𝑗. For a convolutional module, we have:

𝑦𝑗= 𝑏𝑗+ βˆ‘ (𝑀𝑖 𝑖𝑗 βˆ— π‘₯𝑖), (15) where βˆ— is the 2D convolutional operator. Each kernel 𝑀𝑖𝑗 learns a particular feature at every location on the input, forcing spatial invariance.

Activation layer. As in MLP, the point-wise non-linearity such as the hyperbolic tangent and the ReLU is applied to each location π‘–π‘—π‘˜ of the feature maps.

Pooling layer. A pooling layer in its simplest form computes the average values over a neighborhood in each feature map. The average operation can also be replaced by finding the maximum value within the neighborhood. A pooling operation is applicable after each activation layer. It serves two purposes: 1) reducing the sensitivity of convolutional layers to a location, specifically for low-level features, such as edges, and 2) providing a spatially down-sampled

representation of feature maps since by gradually coarsening the feature maps and aggregating information, the network can ultimately learn a global

representation of the image and is more sensitive to the entire input in deeper layers.

Loss function. Once the architecture of an NN is defined, the network can be seen as a function approximator. Assume a training set of 𝑁 training samples 𝐷 = {𝑋, 𝑇} = {π‘₯𝑛, 𝑑𝑛}, 𝑛 ∈ {1, … , 𝑁}, with π‘₯𝑛 an input example, and 𝑑𝑛 a target value associated with that example; π‘‘π‘›βˆˆ {1, … , C}, where 𝐢 is the number of possible target classes. In a classification problem, the network output can be seen as a conditional distribution of the target 𝑑𝑛, given the input π‘₯𝑛 and the parameters πœƒ.

This probability can be modeled by transforming the output of the network 𝑓𝑐(π‘₯𝑛, πœƒ) for each class 𝑐 ∈ {1, … , 𝐢} with a softmax function:

𝑃(𝑑𝑛|π‘₯𝑛, πœƒ) = 𝑒𝑓𝑐(π‘₯𝑛,πœƒ)

βˆ‘ 𝑒𝑗 𝑓𝑗(π‘₯𝑛,πœƒ) . (16)

In the special case where 𝐢 = 2, the softmax function takes the form of the sigmoid function. We consider a multi-categorical problem in which each label 𝑑𝑛 belongs to one of 𝐢 categories. We find model parameters by maximizing the likelihood over the training data 𝐷 with respect to the parameters πœƒ:

πœƒβˆ—= argmax

πœƒ

𝑃(𝑇|𝑋, πœƒ) ,

= argmax

πœƒ

𝑃(𝑑1, 𝑑2, … , 𝑑𝑁|π‘₯1, π‘₯2, … , π‘₯𝑁, πœƒ) ,

= argmax∏ 𝑃(𝑑𝑛|π‘₯𝑛, πœƒ)

𝑁

,

(17)

assuming that the training set is sampled from an unknown independent and identically distributed (iid) distribution. Equation (17) can be written as minimizing the negative log-likelihood as:

πœƒβˆ—= argmin

πœƒ

βˆ’ βˆ‘ βˆ‘ πŸπ‘(𝑑𝑛) [𝑓𝑐(π‘₯𝑛, πœƒ) βˆ’ log (βˆ‘ 𝑒𝑓𝑗(π‘₯𝑛,πœƒ)

𝑗

)]

𝑐 𝑛

, (18)

where 𝟏(π‘₯) is the indicator function. The right-hand side function in (18) is

supposed to be minimized as a loss function. Equation (18) does not have a closed solution, and we apply an iterative approach in order to minimize it. Note that there are other types of loss functions, but the negative log-likelihood provides a simple and consistent parameter estimation framework.

Optimization. Optimization algorithms allow us to update model parameters and minimize the value of the loss function on the training set. The performance of an optimization algorithm affects model training, and correctly tuning its hyperparameters can improve the performance of an NN (Czum, 2020).

The gradient descent method is the most common optimization algorithm used in machine learning, with a randomly initialized set of parameters πœƒ1 is written as:

πœƒπ‘§+1← πœƒπ‘§βˆ’ πœ‚βˆ‡πœƒπΏ(πœƒ, 𝑋, 𝑇). (19) It is reasonable to assume that moving in the direction of the negative gradient with small steps decreases the loss function 𝐿. Therefore, in gradient descent, we first choose an initial value for πœƒ, a constant learning rate πœ‚ > 0, and continuously iterate πœƒ until the stop condition is reached. However, in large-scale machine learning problems, the gradient descent method can become inefficient as it requires a full pass on the data at each iteration. Stochastic gradient descent (Robbins and Monro, 1951) reduces the computational cost at each iteration by uniformly sampling an index 𝑖 ∈ {1, … , 𝑁} for training to estimate the gradient, and the parameters are updated as:

πœƒπ‘§+1← πœƒπ‘§βˆ’ πœ‚βˆ‡πœƒπΏ(πœƒ, π‘₯𝑖, 𝑑𝑖). (20) Stochastic gradient descent is not particularly computationally efficient since CPUs and GPUs cannot exploit the full power of vectorization. Therefore, we use a midterm approach between stochastic and batch methods called mini-batch gradient descent to optimize the loss function. We use a subset of the training set π‘š < |𝐷| to perform each parameter update:

πœƒπ‘§+1← πœƒπ‘§βˆ’ πœ‚βˆ‡πœƒπΏ(πœƒ, π‘₯𝑖,…,𝑖+π‘š, 𝑑𝑖,…,𝑖+π‘š). (21) There are other variants of these learning rules to reduce the noise in

stochastic directions. For example, the momentum method (Rumelhart, Hinton and Williams, 2013) added a mechanism for averaging over multiple past

gradients, called momentum, to accelerate the convergence. The Adagrad method

(Duchi, Hazan and Singer, 2011) used an aggregate of the squares of previously observed gradients to adjust the learning rate. Thus, the coordinates that persistently corresponded to large gradients were substantially penalized, and others with small gradients were gently penalized. A problem with Adagrad is that the learning rate decreases at a predefined schedule effectively, not ideal for nonconvex problems; therefore, RMSProp (Tieleman and Hinton, 2012) was proposed as a simple fix to decouple rate scheduling from coordinate-adaptive learning rates. Finally, the Adam optimizer (Kingma and Prodanova, 2014) combined all these techniques into one efficient learning algorithm, providing a robust and effective optimization algorithm.

Architecture. The number of architectures and algorithms that are used in deep learning is wide and varied. Here, we describe two prevalent CNN

architectures in deep learning: FCN and U-Net. FCN is an architecture used mainly for semantic segmentation, employing locally connected layers, such as

convolution, pooling, and up-sampling. An FCN avoids fully connected layers, converting them into 1Γ—1 convolution layers, enabling the network to operate on arbitrarily sized inputs. The network consists of a down-sampling feature

extraction path and an up-sampling path, which allows for localization. An FCN combines layers of the feature hierarchy and refines the spatial precision of the output by adding skip connections to recover the fine-grained spatial information lost in the down-sampling path.

A U-Net architecture (Figure 4) modifies and extends the FCN architecture. It has many feature channels in the up-sampling path, allowing the network to propagate context information to higher resolution layers. The down-sampling and up-sampling paths of a U-Net are almost similar, leading to a U-shaped architecture.