Saimaa ringed seal re-identification - Seal pose estimation using convolutional neural networks

Early attempts have been made to automatically classify the individual Saimaa ringed seals based on computer vision algorithms [21] [6]. The datasets used for this task were contained 785 images and 131 individual ringed seals. In both papers, unsuper-vised segmentation and texture-based super-pixel classification approaches were adapted to segment the ringed seals to make the identification process easier. In [21], a mix-ture of different techniques, including Local Phase Quantization (LPQ) [22], and sup-port vector machine (SVM) classifier was applied to get better segmentation outcomes.

Segmentation-based fractal texture analysis (SFTA) [23] was calculated through decom-posing the binary pixels of the extracted image and posterior probabilities applied to get the best identification results. However, the yielded outcomes were not satisfactory, later it becomes the cornerstone for further research in [6]. The process of segmentation with the adopted approach is shown in Fig. 7.

Figure 7. Segmentation pipeline. [21]

Chehrsimin et al. [6] continued the study by adding the morphological operation as post-processing steps to improve the segmentation process. The performance of the identifi-cation method was measured by utilizing previously used approaches, namely HotSpot-ter [12] and Wild-ID [11] as the generic identification methods. These methods require a large set of individual ringed seals, including one image of the same individual taken from the same side as the query image. However, only 591 images were used in the ex-periments as images that did not relate to this acceptable quality were eliminated from the experiment. Despite the lack of a large set the images for of each individual ringed seals, the HotSpotter yield promising results. An example of the identification process is shown in Fig. 8.

Figure 8. Identification: left shows correct Identification and right shows incorrect identification of ringed seals from query image. [21]

Nepovinnykh et al. [24] achieved 91% identification accuracy using the AlexNet CNN-based architecture on the Saimaa ringed seal dataset. Two methods for identification were considered. AlexNet CNN-based identification and CNN-based feature extraction config-ured with SVM-based classification. The extraction of seal patterns from the background image was done in the segmentation process as a prepossessing step before the identifi-cation step. During the segmentation process, some unnecessary information, such as the head and tail of the ringed seal was included in the image, thus only the subtle variation of the ringed seal body is supposed to consider in the identification process.

In [5] a method was developed to re-identify individual Saimaa ringed based on matching the pelage pattern of an individual ringed seals. The steps of re-identification are shown in Fig. 9. Deeplab model [25] was used to separate the seal from the background image. Ad-ditionally, two post-processing steps were considered. The close holes and a smoothing border were used to ensure that the pattern was included in the seal segmentation map. To

segment, the pattern from the seal’s fur, Sato tubeless [26] technique is applied. To high-light the obtained pattern, histogram normalization, thresholding, morphological opening, and sharpening techniques were used. After that, the Siamese network was trained with a triplet loss function and a distance metric was used to compute the similarity of the patches using the CNN-based model.

Figure 9. Algorithm for the segmentation. [5]

The ultimate strive of the study [5] was to make the re-identification of individual ringed seals improved over existing approaches through predicting unique identifiers from each individual animal seals of the given datasets of the known individuals comprising query and gallery images. The similarity of these images is then ranked by dividing these images into patches and using heatmap similarity (heatmaps shows the similarity between the re-gion of the gallery image and query image) as shown in Fig. 10, then topology-preserving projections are applied to filter the candidate. Finally, the ranking of candidates from these images is calculated based on the lowest average weight of topologically similar projections.

The proposed re-identification algorithms [5] brought promising results over the existing methods of Saimaa ringed seals (see Fig. 11). However, during the evaluation of the network the authors confronted with the following problems:

• Unnecessary noise hinders the network from simplifying the task throughout the training process.

• An enormous variation in the pose and illumination, such as lighting settings, pat-tern discernibility, and obscurities are also inevitable in non-processed images.

Figure 10. Heatmap similarity between the region of gallery and query images of the known individual ringed seal from a large dataset: left indicates a query image and right is from a gallery image. [5]

Figure 11. Saimaa re-identification:yellow indicates a query images, green is correct and red is misclassified images. [5]

3 CONVOLUTIONAL NEURAL NETWORKS

Artificial neural networks (ANNs), often known as neural networks, are computational models that imitate the neural function of the human brain. This thesis primarily focuses on the convolutional neural network (CNN), which is a particular case of neural networks.

A neural network often involves several processors that work in parallel and arranged in neurons [27]. Neurons are generally divided into unit layers including the input, output, and many hidden layers. The unit (nodes) are connected to each other and have a spe-cific weight and bias. Each link associated with the weight reflects the strength of the connection between the inputs. The main layer receives the original input information.

Figure 12.Neural network Architecture. [27]

Each neuron receives a multiplied version of inputs and random weights, which is then combined with a constant bias value (unique to each neuron layer), and then transferred to a proper activation function, which determines the neuron’s final output value. Once the final neural network layer’s output is created, the loss function is determined, and backpropagation is applied to change the weights to make the loss minimum [27]. The main task of the neural network is to estimate a function f which takes data sample x as input and assigns them to their corresponding label y. Here, the purpose is to learn a function that can be computed as

Yˆ =f(x) (1)

whereYˆ is the categories of output tensor,f is a learnable function andxis a sample of

input data. The functionf is parameterized by a set of parametersθ, which are generally learned from a collection of data for which the labels are defined. During the training phase, the network makes a prediction and involves a cost or loss based on the correct result [28]. Backpropagation algorithm is utilized when a network needs adjustment based on the loss, and updates the individual neurons in the network and therefore makes a better prediction. The backpropagation process begins at the network’s end, with a single loss value depending on the output, and updates neurons in reverse order, with the neurons at the network’s beginning updated last.

3.1 Architecture and building blocks

CNN is a multi-layer deep neural network model, which features high fault tolerance, self-learning, and parallel processing capabilities [28]. It is a multi-layer perceptron aimed to detect 2D images with a local connection and weight-sharing network with a topology similar to a biological neural network. The primary assumption of CNN is to detect complete objects by using predefined convolution filters to identify image edges, object parts, and patterns based on this knowledge. The neurons in a CNN are organized in a three-dimensional structure, with each group of neurons evaluating a small space or an image feature [29].

CNN architecture is typically composed of distinct building blocks that transform input data into output data [28]. These include the convolutional layer, the pooling layer, and the fully connected layer (see Fig. 13). Feature extraction is performed by both convolutional and pooling layers, whereas a fully connected layer transforms the extracted features into the final output such as classification and object detection.

3.1.1 Input layer

The input layer, which can handle multi-dimensional data, represents the image input into the CNN network. Every image is a pixel value matrix. In most cases, the two-dimensional convolutional neural network’s input layer receives two-two-dimensional. The inclusion of different color channels in colored images, particularly RGB (Red, Green, Blue) based images, adds an extra depth field to the data and makes the input 3-dimensional.

Figure 13.Architectural design of CNN.

3.1.2 Convolutional layer

Convolutional layers are the foundation of CNN because they contain learned kernels (weights) that extract features, allow different images to be distinguished from one an-other. In the convolutional layer, the input image is convolved using a small matrix (for example, 3x3, 5x5, or 7x7) called kernels or filters. A weight coefficient and a bias vec-tor correspond to each convolution kernel element, which is similar to the neuron of a network [28]. Convolution operation can be expressed as

f₁^k(p, q) =X

x,y

i_c(x, y).e^k₁(u, v) (2) wheref₁^k(p, q)refers to the component of a feature matrix,cto channel index andi_c(x, y) is an component of the input tensor of channel index I_c, and e^k₁(u, v) to component of kernelK of the layer numberl.

3.1.3 Pooling layer

Feature extraction is done in the convolutional layer and the pooling layer receives the output feature map for feature selection and information filtering. The pooling layer typ-ically performs a maximum or average operation on the output of convolution layers to preserves the most relevant information and changes (shrinks) the spatial size of the out-put [13].

Max pooling is one of the known pooling operations, which extracts patches from the input feature maps, outputs the utmost value in each patch, and removes all the rest values.

A max-pooling with a kernel size of 2 × 2 with a stride of 2 is usually utilized in a practical sense. This downsamples the in-plane dimension of feature maps by an element of two.

The operation of the pooling layer can be expressed as

Z_l^k =g_p(F_l^K) (3)

where l and k indicates to number of kernel and layers respectively, Z_l^k represents the pooled feature-map of l^th layer fork^th input feature-map, whereasg_p denotes the type of pooling operation of input feature-map(F_l^K).

3.1.4 Fully connected layer

The fully connected layer of a convolutional neural network is equal to the hidden layer in a classic feedforward neural network. It means that every neuron in the preceding layer is connected to every neuron in the next layer. The high-level attributes of the input images are represented by the output of convolutional and pooling layers. The function of the fully connected layer is to use these attributes to categorize the input image into several classes depending on the training dataset. [28].

3.1.5 Activation function

The activation function is an important parameter in the CNN model. This function is used to learn and estimate any continuous and complicated relationship between network variables. In simple word, it determines the neuron’s output when an input or set of inputs are given. This output is then utilized as input to the next node, which determines which model information should be sent forward and which should not at the network’s end [28].

Several activation functions are available depending on the nature of the input values.

The Rectified linear unit (ReLU), Softmax, tanH, and Sigmoid functions are some of the most often utilized activation functions [27]. Each one of these functions has a distinct purpose. The sigmoid and softmax functions are recommended for a binary classification CNN model while softmax is commonly utilized for multi-class classification. However, the gradient of sigmoid and tanh have always non-zero value, which may be undesirable

for training. As the neural network architecture becomes more complex, the gradient signal begins to vanish. This is due to the fact that the gradient of such functions is roughly equal to zero practically everywhere except the center.

ReLU is the most preferable activation function as it has a continuous gradient for positive input. Although the function is not differentiable, it may be ignored in practice. ReLU generates a more sparse representation as the zero in the gradient yields a total zero. ReLU can be computed as

ReLu(x) =max(0, x) (4)

where ReLu(x) is an output for ReLu, and max(0, x) stands for the output ReLu is always zero when the inputxis≤ 0. It isxwhen the output is>= 0because the output is then equal tox.

3.1.6 Batch normalization

During network training, the allocation of each layer’s input varies when the parameters of the preceding layers vary. This causes training to take longer since it requires careful parameter tuning and lower learning rates, and it makes training models challenging with saturating non-linearities. To address this problem, the layer inputs need to be normalized.

This can be done by including normalization for each training mini-batch as part of the model design [30].

Batch Normalization (BN) [28] enables to employ considerably greater learning rates while being less concerned about initialization. The main principle behind batch normal-ization is, when entering each layer of the network, insert the normalnormal-ization layer, that is, perform the normalization process first, and then enter the next layer of the network.

It is a learnable, parameterized network layer and reduces the requirement for dropout in some circumstances.

3.1.7 Dropout layer

The dropout layer is part of the network used to prevent CNN from overfitting. The dropout generates a sparse activation from a particular layer, which interestingly, in turn, promotes a network to learn a sparse representation as a side effect. During training,

the random sample of the weight layer parameter is taken with a particular probability and these sub-networks are used as the target network for this update. It can be assumed that if the total network has n parameters, the number of sub-networks that can use is 2ⁿ. Furthermore, when n is high, the sub-networks used in each iteration update are not repeated, thereby avoids a particular network from being over-fitted to the training set [28].

In document Seal pose estimation using convolutional neural networks (sivua 14-23)