2.3 image segmentation
2.3.5 Neural networks
Artificial neural networks or neural networks (NNs) are a family of machine learning models that mimic the human brainβs biology and the way it operates.
These models can learn to perform specific tasks based on the provided examples.
For instance, we can feed an NN with thousands of labeled images of cars, houses, and trees and expect that it would find visual patterns in the images to identify cars, houses, and trees in other images. Neural networks are essentially a set of connected artificial neurons, where the connection strength between pairs of neurons, called model parameters, needs to be learned by model training. We train a model through a procedure called backpropagation, adjusting model parameters to minimize a cost function. Typically, neurons of one layer are connected to neurons from the previous and subsequent layers. An input signal enters the first layer, and after a sequence of non-linear transformations in the intermediate layers, called hidden layers, it is received from the output layer. An NN can have a minimum of three layers: one input layer, one hidden layer, and one output layer; however, we can add more hidden layers. An NN with an architecture containing more than three layers is called a deep neural network (DNN). Theoretically, a shallow network with enough neurons in the hidden layer can represent any function, but in practice, DNNs have a better performance because each layer adds its level of non-linearity, increasing the abstraction capacity. This section introduces multilayer perceptron (MLP) basics and the convolutional neural networks (CNNs) models, a specialized NN used in computer vision tasks.
2.3.5.1 Multilayer perceptron
An MLP is a model that applies a sequence of non-linear transformations to the input data (Figure 3a). Formally, we describe an MLP with πΏ layers as follows (we use similar notations as in (Farabet, 2013)):
π¦ = π(π₯, π) = βπΏ (8)
βπ= ππ(ππβπβ1+ ππ), βπ β {1, β¦ , πΏ} (9)
β0= π₯ (10)
where π = {ππ, ππ}, βπ β {1, β¦ , πΏ} is the set of trainable parameters consisting of bias parameters ππ and trainable weight parameters ππ for each layer π, π₯ β βπππ is the input vector, π¦ β βπππ’π‘ is the output vector, and ππ is a point-wise non-linearity at layer π. The most common non-linear activation functions for ππ, βπ β {1, β¦ , πΏ β 1}
are the hyperbolic tangent and the rectified linear unit. Without activation functions, the whole system is a set of linear operations and could be written as single matrix multiplication.
2.3.5.2 Convolutional neural networks
CNNs are a subset of DNNs that take images as the input, the basic linear layers are replaced with convolutional layers, and a spatial pooling function follows non-Figure 3. (a) A multilayer perceptron model with two hidden layers. (b) The architecture of a convolutional neural network. This figure is generated by adapting the code from Gavin Weiguang Dingβs Github.
linear activations (Figure 3b). CNNs were developed because solving image-related tasks using a DNN for an even a modest size image, e.g., 200Γ200Γ3 with three color channels, contains 120,000 units of information. A single fully connected neuron on the first layer of an NN would require 120,000 parameters. Having multiple layers in the network, the number of parameters in such a model would quickly increase, resulting in overfitting and computational problems. The architecture of CNNs, on the other hand, possesses two fundamental properties useful for image applications: spatially shared weights and spatial pooling.
Another advantage of CNNs is that by sliding kernels over images, a kernel is a matrix that performs a dot product with the input, we can deal with arbitrary-sized images compared to fixed-sized inputs of MLPs. For a general form CNN, we can write:
π = π(π, π) = π»πΏ (11)
π»π= πππππ(ππ(πππ»πβ1+ ππ)), βπ β {1, β¦ , πΏ} (12)
π»0= πΌ (13)
where π = {ππ, ππ}, βπ β {1, β¦ , πΏ} is the set of trainable parameters, π is the input array of vectors (image πΌ is an array of pixels), π is an array of output vectors, ππ is a point-wise non-linearity at layer π, and ππππ is an optional pooling function at layer π.
The matrices ππ are Toeplitz matrices in CNNs, while these matrices can take any general form in MLPs. Each hidden unit array π»π is a convolution between ππ
and its previous hidden unit π»πβ1, transformed through a point-wise non-linearity and possible pooling layer, written as:
π»ππ= πππππ(ππ(πππ+ β ππππβ π»πβ1,π ππππππππ‘π (π)
)), (14)
where π»ππ is the ππ‘β component, called a feature map, of the ππ‘β feature vector maps π»π. The output of a CNN is connected to a final activation function, e.g., a softmax function, so that the optimization becomes a maximum a posteriori estimation. In the remainder of this section, we describe the three building blocks of a CNN: the convolutional layer, activation function, and pooling function. Then, we describe the learning procedure of NNs, including the loss function and optimization. We finalize this section, briefly describing two important CNN architectures: fully convolutional network (FCN) and U-Net.
Convolutional layer. A convolutional layer cross-correlates the input and kernel and adds a scalar bias to produce an output. A convolutional layer has two parameters: the kernel and scalar bias. We typically initialize the kernels randomly when training models. The input of every layer is a 3D array with π1 2D feature
maps of size π2Γ π3. We denote each component as π₯πππ and each feature map as π₯π. The output π¦ is a 3D array with π1 feature maps of size π2Γ π3. A trainable kernel π€ππ of size π1Γ π2 and bias parameter ππ connect an input feature map π₯π with an output feature map π¦π. For a convolutional module, we have:
π¦π= ππ+ β (π€π ππ β π₯π), (15) where β is the 2D convolutional operator. Each kernel π€ππ learns a particular feature at every location on the input, forcing spatial invariance.
Activation layer. As in MLP, the point-wise non-linearity such as the hyperbolic tangent and the ReLU is applied to each location πππ of the feature maps.
Pooling layer. A pooling layer in its simplest form computes the average values over a neighborhood in each feature map. The average operation can also be replaced by finding the maximum value within the neighborhood. A pooling operation is applicable after each activation layer. It serves two purposes: 1) reducing the sensitivity of convolutional layers to a location, specifically for low-level features, such as edges, and 2) providing a spatially down-sampled
representation of feature maps since by gradually coarsening the feature maps and aggregating information, the network can ultimately learn a global
representation of the image and is more sensitive to the entire input in deeper layers.
Loss function. Once the architecture of an NN is defined, the network can be seen as a function approximator. Assume a training set of π training samples π· = {π, π} = {π₯π, π‘π}, π β {1, β¦ , π}, with π₯π an input example, and π‘π a target value associated with that example; π‘πβ {1, β¦ , C}, where πΆ is the number of possible target classes. In a classification problem, the network output can be seen as a conditional distribution of the target π‘π, given the input π₯π and the parameters π.
This probability can be modeled by transforming the output of the network ππ(π₯π, π) for each class π β {1, β¦ , πΆ} with a softmax function:
π(π‘π|π₯π, π) = πππ(π₯π,π)
β ππ ππ(π₯π,π) . (16)
In the special case where πΆ = 2, the softmax function takes the form of the sigmoid function. We consider a multi-categorical problem in which each label π‘π belongs to one of πΆ categories. We find model parameters by maximizing the likelihood over the training data π· with respect to the parameters π:
πβ= argmax
π
π(π|π, π) ,
= argmax
π
π(π‘1, π‘2, β¦ , π‘π|π₯1, π₯2, β¦ , π₯π, π) ,
= argmaxβ π(π‘π|π₯π, π)
π
,
(17)
assuming that the training set is sampled from an unknown independent and identically distributed (iid) distribution. Equation (17) can be written as minimizing the negative log-likelihood as:
πβ= argmin
π
β β β ππ(π‘π) [ππ(π₯π, π) β log (β πππ(π₯π,π)
π
)]
π π
, (18)
where π(π₯) is the indicator function. The right-hand side function in (18) is
supposed to be minimized as a loss function. Equation (18) does not have a closed solution, and we apply an iterative approach in order to minimize it. Note that there are other types of loss functions, but the negative log-likelihood provides a simple and consistent parameter estimation framework.
Optimization. Optimization algorithms allow us to update model parameters and minimize the value of the loss function on the training set. The performance of an optimization algorithm affects model training, and correctly tuning its hyperparameters can improve the performance of an NN (Czum, 2020).
The gradient descent method is the most common optimization algorithm used in machine learning, with a randomly initialized set of parameters π1 is written as:
ππ§+1β ππ§β πβππΏ(π, π, π). (19) It is reasonable to assume that moving in the direction of the negative gradient with small steps decreases the loss function πΏ. Therefore, in gradient descent, we first choose an initial value for π, a constant learning rate π > 0, and continuously iterate π until the stop condition is reached. However, in large-scale machine learning problems, the gradient descent method can become inefficient as it requires a full pass on the data at each iteration. Stochastic gradient descent (Robbins and Monro, 1951) reduces the computational cost at each iteration by uniformly sampling an index π β {1, β¦ , π} for training to estimate the gradient, and the parameters are updated as:
ππ§+1β ππ§β πβππΏ(π, π₯π, π‘π). (20) Stochastic gradient descent is not particularly computationally efficient since CPUs and GPUs cannot exploit the full power of vectorization. Therefore, we use a midterm approach between stochastic and batch methods called mini-batch gradient descent to optimize the loss function. We use a subset of the training set π < |π·| to perform each parameter update:
ππ§+1β ππ§β πβππΏ(π, π₯π,β¦,π+π, π‘π,β¦,π+π). (21) There are other variants of these learning rules to reduce the noise in
stochastic directions. For example, the momentum method (Rumelhart, Hinton and Williams, 2013) added a mechanism for averaging over multiple past
gradients, called momentum, to accelerate the convergence. The Adagrad method
(Duchi, Hazan and Singer, 2011) used an aggregate of the squares of previously observed gradients to adjust the learning rate. Thus, the coordinates that persistently corresponded to large gradients were substantially penalized, and others with small gradients were gently penalized. A problem with Adagrad is that the learning rate decreases at a predefined schedule effectively, not ideal for nonconvex problems; therefore, RMSProp (Tieleman and Hinton, 2012) was proposed as a simple fix to decouple rate scheduling from coordinate-adaptive learning rates. Finally, the Adam optimizer (Kingma and Prodanova, 2014) combined all these techniques into one efficient learning algorithm, providing a robust and effective optimization algorithm.
Architecture. The number of architectures and algorithms that are used in deep learning is wide and varied. Here, we describe two prevalent CNN
architectures in deep learning: FCN and U-Net. FCN is an architecture used mainly for semantic segmentation, employing locally connected layers, such as
convolution, pooling, and up-sampling. An FCN avoids fully connected layers, converting them into 1Γ1 convolution layers, enabling the network to operate on arbitrarily sized inputs. The network consists of a down-sampling feature
extraction path and an up-sampling path, which allows for localization. An FCN combines layers of the feature hierarchy and refines the spatial precision of the output by adding skip connections to recover the fine-grained spatial information lost in the down-sampling path.
A U-Net architecture (Figure 4) modifies and extends the FCN architecture. It has many feature channels in the up-sampling path, allowing the network to propagate context information to higher resolution layers. The down-sampling and up-sampling paths of a U-Net are almost similar, leading to a U-shaped architecture.