Seal pose estimation using convolutional neural networks

(1)

Computational Engineering and Technical Physics Computer Vision and Pattern Recognition

Yordanos Alemu

SEAL POSE ESTIMATION USING CONVOLUTIONAL NEURAL NETWORKS

Master’s Thesis

Examiners: Professor Heikki Kälviäinen

Assoc. Prof. Vadim Alexandrovich Onufriev Supervisors: M.Sc. Ekaterina Nepovinnykh

D.Sc. Tuomas Eerola Professor Heikki Kälviäinen

(2)

Lappeenranta-Lahti University of Technology LUT School of Engineering Science

Computational Engineering and Technical Physics Computer Vision and Pattern Recognition

Yordanos Alemu

Seal pose estimation using convolutional neural networks

Master’s Thesis 2020

55 pages, 34 figures, 6 tables.

Examiners: Professor Heikki Kälviäinen

Assoc. Prof. Vadim Alexandrovich Onufriev

Keywords: computer vision, pose estimation, saimaa ringed seal, keypoint detection, convolutional neural network

Conservation of the endangered Saimaa ringed seal population requires an effective approach to re-identify animals, which would allow biologists and ecologists to monitor the population. The ideal approach of ringed seal re-identification would be to automate it using computer vision techniques. Despite recent progress, the re-identification process remains challenging due to, for example, the large pose variation and illumination conditions of individual animals in the image. This thesis proposes an approach for seal pose estimation to simplify the subsequent re-identification steps. The problem of the seal pose estimation is formulated as a seal body keypoint detection task and solved using a stacked hourglass network based method. Moreover, the network training is done based on the depth of the hourglasses using fewer parameters. The obtained results on the Saimaa ringed seal image database show that the method’s accuracy on detecting keypoints increases from 65.3% to 76.7% when the depth is increased.

(3)

First and foremost, I cannot express enough praises and thanks to God, the Almighty, for his showers of health and grace to lead me this far.

I would like to extend my deepest appreciation and gratitude to my supervisors M.Sc.

Ekaterina Nepovinnyk, D.Sc. Tuomas Eerola and Prof. Heikki Kalviainen for their in- sightful comments and supervision throughout the research. I would also want to thank all of my instructors for mentoring and taught me the methodology to conduct the research and present the research works as plainly as possible.

I would love to extend my heartfelt appreciation to my family and friends who were standing by me at a personal and friendly level. This work would not have been possible without their constant love and motivation.

Lappeenranta, June 1, 2021

Yordanos Alemu

(4)

LIST OF ABBREVIATIONS

2D Two Dimensional

3D Three Dimensional

ANNs Artificial Neural Networks

ATRW Amur Tiger Re-identification in the Wild BN Batch Normalization

CNN Convolutional Neural Network

CV Computer Vision

DNNs Deep Neural Networks

F-CNN Fully-Convolutional Neural Network LPQ Local Phase Quantization

PCK Percentage of Correct Key-points PDJ Part Detection Joints

PGCFL Pose-Guided Complementary Features Learning PPbM Pose Part-based Model

ResNet Residual Neural Network ReLU The Rectified Linear Unit RGB Read Green Blue

SFTA Segmentation-Based Fractal Texture Analysis SMAL Skinned Multi-Animal Linear

SVM Support Vector Machine

WS-CDA Weakly- and Semi-Supervised Cross-Domain Adaptation WWF World-Wide Fund for Nature

YOLO You Only Look

(7)

1 INTRODUCTION

1.1 Background

Re-identifying the endangered species is important for monitoring purposes. It is a crucial task for biologists and ecologists to make sure that these animals are conserved.

The Saimaa ringed seal is an endangered seal species, which requires special attention to prevent extinction. According to the current report of the World-Wide Fund for Nature (WWF) [1], around 410 seals are present in Lake Saimaa, Finland.

Conventionally, wildlife identification has been done manually with the help of invasive methods such as natural tagging, live trapping, tissue, and blood sampling. However, invasive approaches may constitute additional risks to the already endangered species, such as environmental instability, and negatively affect their behavior, including psychological stress, mortality, and reduced wildlife populations on a large scale. Therefore, alternative non-invasive approaches using photo identification have come to the ground to avoid such problems [2] [3].

Photo identification is a widely used non-invasive technique for tracking and detecting an individual animal species from their natural markings in the image. It involves several applications, including tracking the populations, migration patterns, and general behavior of animal species, and greatly secures behavior and population demography studies over time. Photo identification tracks and records the natural mark of species, enabling the preservation of images in the library for eventual cross-matching and generation of captured data. The library can be examined manually to develop suitable matches between the individual animals. However, as the number of images in the library grows a great deal of manual work is required [4]. As a result, automatic animal re-identification tools came to attention.

Automatic re-identification is a non-invasive technique used for obtaining valuable information regarding animal behavior and population dynamics in the wilderness of endangered species [2]. Recently, photo identification using automatic camera trapping has brought promising outcomes for re-identifying individual ringed seals as it captures information for various animal species concurrently and continuingly. Camera trapping enables the collection of wildlife images inexpensively, unobtrusively, and regularly based on computer vision algorithms comprises segmentation, post-processing, and identification approaches [5] [6]. Initially, segmentation is often used to locate the seal and erase

(8)

the background. The segmented image is then post-processed, and the seals are detected and re-identified based on a matching-based approach.

However, photo identification using computer vision techniques is challenging for the Saimaa ringed seals. Re-identifying individual ringed seals from images remain a demanding task due to the following reasons [5]:

• As the animals are naturally camouflaged, some areas of the body are hidden (occluded).

• Illumination problems, that is, varying lighting conditions and noise hinder pattern visibility.

• Not all key identifying features are visible due to poor quality of camera traps.

• The shape and pose of the animals vary between images as shown in Fig. 1.

Figure 1. Example photos of the Saimaa ringed seals that appears in different poses.

In addition to these challenges, it is impractical to gather enough image data for each individual ringed seal. Thus, to overcome the above stated challenges, a clear representation and estimation of the pose of individual ringed seals are required beforehand. It can be assumed that knowing the pose of the seal make re-identification (automatic photo identification) easier. For example, by analyzing which side of the animal is visible in the image database and then the only images of the same side in the database of known individuals can be used to find the matching individual limiting the search space. More- over, being able to represent and estimate the pose more accurately makes it possible to normalize the pattern and allows performing the matching by comparing the patterns in the corresponding body parts of the ringed seals.

(9)

The pose estimation problem refers to determining the geometrical allocation of an individual or an object in the images. There are different methods of estimating animal poses (see Fig. 2), for example, by classifying their poses [7], based on the pose of the animal in the image as right, left, frontal, and back to mention some.

More advanced approaches include skeleton recognition [8] and 3D modeling [9]. Skele- ton detection can be done by identifying and mapping key points that define the animal pose in the image. On the other hand, 3D pose estimation by allocating the 3D spatial arrangement of all key points in the image. 3D modeling of the entire body is also performed by automatically collecting 3D textured models of animals from the image [9].

Figure 2. Approaches of pose estimation: (a) Pose classification; (b) Skeleton detection; (c) 3D modeling. [7] [9]

This master’s thesis is a part of the CoExist research project funded by South-East Fin- land, Russia CBC 2014-2020 program [10]. It is engaged in developing automatic image- based identification for monitoring and recognizing individual Saimaa ringed seals. More- over, this can help to accurately measure the population size and density of ringed seals, as well as allows a better understanding about the behavior of ringed seals. Therefore, this work typically focuses on developing the approaches of pose estimation for Saimaa ringed seals to make the re-identification process as simple as possible. Consequently, better re-identification accuracy of individual Saimaa ringed seal is noticeable.

1.2 Objectives and delimitation

The main goal of this master thesis is to develop a method to estimate the pose of Saimaa ringed seals. In particular the objectives of this thesis are as follows:

(10)

• To study and to understand different ways of representing and estimating the seal pose based on the studies that have already been done on other animal species.

• To find the most suitable method for Saimaa ringed seals.

• To implement and evaluate the selected method.

Delimitations of the work are as follows:

• Only Saimaa ringed seals are considered.

• The work focuses only on pose estimation, thus detection, segmentation and re- identification of the Saimaa ringed seals are not considered.

1.3 Structure of the thesis

The rest of the chapters in the thesis is arranged as follows: related works on wildlife photo identification and Saimaa ringed seals re-identification based on wildlife photo identification are presented in Chapter 2. Chapter 3 describes the overall building blocks and architecture of Convolutional neural network (CNN). Chapter 4 states the idea behind pose estimation including human pose estimation, animal pose estimation, as well as approaches associated with animal pose estimation. Chapter 5 acquaints specifically with seal pose estimation and pipelines used for pose estimation. Chapter 6 comprises the experimental part of the work. The results of the experiments are discussed Chapter 7.

The conclusion is given in Chapter 8.

(11)

2 WILDLIFE PHOTO IDENTIFICATION

2.1 Automatic animal re-identification

Generally, to react with fundamental ecological factors related to animal behavior and population trends, re-identification of individual animal species is the most vital task to consider. Animal re-identification is used as a tool to automate animal populations as- sessed from a camera trap. Several approaches have been studied to identify the individual animal of various species to automatically gather precise data about the living condition of the animals in their environment [2]. This provides a great ability for biologists and con- servationists to study the lifespan, migration patterns, and social relationships between the animals and monitor them in the ecosystem with less effort compared with the traditional counterparts.

Various methods of animal re-identification have been suggested. Some of them are specifically developed for specific animal identification, whereas others are used as species agonistic methods. In [11] a computer-assisted photographic mark-recapture method, the Wild-Id software was created to perform pairwise pattern-matching between images and used is used for photo-identification. In [12] HotSpotter was adapted as a multi-species animal identification method to recognize giraffes, Grevy’s zebras, plain zebras, lionfish, and leopards. HotSpotter’s main achievement is as it utilizes the standpoint invariant descriptors and a score technique that highlights the most unique keypoints and descriptors by allowing only the k nearest neighbors of any descriptor to contribute to the scoring.

Deep-learning methods using the convolutional neural network (CNN) architecture pro- vide the ability to automatically detect and identify individual animals directly from the image (see Fig. 3). CNN is a category of deep learning specially utilized for image processing purposes. The basic idea behind CNN is to recognize patterns on the image edges, parts of the objects and build information to identify complete objects, including animals, human beings, and vehicles. The details about CNN are described in Chapter 3. Nguyen et al. [13] employed the CNN model to train a computational model capable of sorting animal images and recognizing species from Wild dataset images captured by camera traps.

Recent studies on animal re-identification have been shown encouraging results for identifying animals such as Saimaa ringed seals [5], lemur [3], pigs [14], primates [15], elephants [16], pandas [17], and tigers [9] to mention some of them. In [14], three ap-

(12)

Figure 3.CNN building blocks. [13]

proaches such as Fisherfaces, the pre-trained face CNN model VGG-Face, and the CNN model used for data augmentation were studied to recognize individual pigs on the farm setting based on face detection methods. They combined the CNN-based approaches with a non-invasive imaging system (see Fig. 4) to re-identify individual pigs from the images.

Although they achieved 96.7% accuracy, the system encountered some problems of re- identifying individual pigs from their face as it uses the image data of the pigs that are affected by uncontrolled factors such as pose variation, lighting difference, and dirt.

Figure 4.Methods for the identification of pigs on farm. [14]

Debayan et al. [15] used three image databases comprises 3,000 face images of 129 lemurs, 1,450 face images of 49 golden monkeys, and 2,109 face images of 24 chimpanzees (see Fig. 5) to recognize individual primates based on the primates face detection approaches, utilizing the PrimNet network. Two open source systems, including FaceNet and SphereNet and lemur identification systems, have been used to assess the success of the applied method. The method has achieved 98.7% accuracy for identifying individual primates. However, the authors found out that the network did not fully classify the individual primates accurately, because of the poor selection of a query from the image databases of the species and suggested further improvement. Thus, suitable image datasets are required to tackle these challenges.

(13)

Figure 5. Examples from primate dataset: (a) Lemur; (b) Golden monkeys; (c) Chimpanzees.

Modified from [15].

Körschens et al. [16] introduced an elephant re-identification dataset to identify individual elephants automatically. The dataset consists of 276 elephant individuals in 2078 images with many challenges like color variation, aging effects, occlusion. The You Only Look Once (YOLO) [18] object detector for recognizing elephant head, the ImageNet [19] for feature extraction, and the support vector machine [20] for the classification of an elephant image were applied. The proposed method achieved top-10 accuracy of 80% for identifying individual elephants. However, the system accuracy is not satisfactory as it confronted with difficulties while recognizing individuals from the introduced elephant dataset consists of multiple abnormalities.

Brust et al. [17] engaged in developing a method for identifying giant pandas by their face. Similar to [16], they introduced the panda dataset to develop and test panda face recognition algorithms with a structure of deep neural networks. The stated algorithm reached the highest identification accuracy. However, the algorithm handles only the images taken from the frontal panda face direction. Accordingly, images of the panda’s face taken from different directions were considered misclassified images (see Fig. 6).

Quality camera traps capable of capturing an image from different positions of the face are required for the better identification of individual pandas.

Shuyuan et al. [9] implemented the baseline method, a pose part-based model (PPbM) for Amur tigers re-identification inspired by pedestrians and vehicle re-identification approaches. To test the method performance, they introduced the Amur tiger re-ID datasets.

The re-identification methods based on the PPBM model made the pose variation of the tiger’s body easily scalable. Although the model achieves top-5 accuracy of 96.6% over the existing re-identification approaches, it has a high probability to influence by pose estimation errors and occlusion.

(14)

Figure 6. Panda identification: (a) correctly identified panda’s images with frontal face direction and (b) misclassified as face was taken from different direction. [17]

2.2 Saimaa ringed seal re-identification

Early attempts have been made to automatically classify the individual Saimaa ringed seals based on computer vision algorithms [21] [6]. The datasets used for this task were contained 785 images and 131 individual ringed seals. In both papers, unsuper- vised segmentation and texture-based super-pixel classification approaches were adapted to segment the ringed seals to make the identification process easier. In [21], a mix- ture of different techniques, including Local Phase Quantization (LPQ) [22], and support vector machine (SVM) classifier was applied to get better segmentation outcomes.

Segmentation-based fractal texture analysis (SFTA) [23] was calculated through decom- posing the binary pixels of the extracted image and posterior probabilities applied to get the best identification results. However, the yielded outcomes were not satisfactory, later it becomes the cornerstone for further research in [6]. The process of segmentation with the adopted approach is shown in Fig. 7.

Figure 7. Segmentation pipeline. [21]

(15)

Chehrsimin et al. [6] continued the study by adding the morphological operation as post- processing steps to improve the segmentation process. The performance of the identification method was measured by utilizing previously used approaches, namely HotSpot- ter [12] and Wild-ID [11] as the generic identification methods. These methods require a large set of individual ringed seals, including one image of the same individual taken from the same side as the query image. However, only 591 images were used in the experiments as images that did not relate to this acceptable quality were eliminated from the experiment. Despite the lack of a large set the images for of each individual ringed seals, the HotSpotter yield promising results. An example of the identification process is shown in Fig. 8.

Figure 8. Identification: left shows correct Identification and right shows incorrect identification of ringed seals from query image. [21]

Nepovinnykh et al. [24] achieved 91% identification accuracy using the AlexNet CNN- based architecture on the Saimaa ringed seal dataset. Two methods for identification were considered. AlexNet CNN-based identification and CNN-based feature extraction config- ured with SVM-based classification. The extraction of seal patterns from the background image was done in the segmentation process as a prepossessing step before the identification step. During the segmentation process, some unnecessary information, such as the head and tail of the ringed seal was included in the image, thus only the subtle variation of the ringed seal body is supposed to consider in the identification process.

In [5] a method was developed to re-identify individual Saimaa ringed based on matching the pelage pattern of an individual ringed seals. The steps of re-identification are shown in Fig. 9. Deeplab model [25] was used to separate the seal from the background image. Ad- ditionally, two post-processing steps were considered. The close holes and a smoothing border were used to ensure that the pattern was included in the seal segmentation map. To

(16)

segment, the pattern from the seal’s fur, Sato tubeless [26] technique is applied. To high- light the obtained pattern, histogram normalization, thresholding, morphological opening, and sharpening techniques were used. After that, the Siamese network was trained with a triplet loss function and a distance metric was used to compute the similarity of the patches using the CNN-based model.

Figure 9. Algorithm for the segmentation. [5]

The ultimate strive of the study [5] was to make the re-identification of individual ringed seals improved over existing approaches through predicting unique identifiers from each individual animal seals of the given datasets of the known individuals comprising query and gallery images. The similarity of these images is then ranked by dividing these images into patches and using heatmap similarity (heatmaps shows the similarity between the region of the gallery image and query image) as shown in Fig. 10, then topology-preserving projections are applied to filter the candidate. Finally, the ranking of candidates from these images is calculated based on the lowest average weight of topologically similar projections.

The proposed re-identification algorithms [5] brought promising results over the existing methods of Saimaa ringed seals (see Fig. 11). However, during the evaluation of the network the authors confronted with the following problems:

• Unnecessary noise hinders the network from simplifying the task throughout the training process.

• An enormous variation in the pose and illumination, such as lighting settings, pattern discernibility, and obscurities are also inevitable in non-processed images.

(17)

Figure 10. Heatmap similarity between the region of gallery and query images of the known individual ringed seal from a large dataset: left indicates a query image and right is from a gallery image. [5]

Figure 11. Saimaa re-identification:yellow indicates a query images, green is correct and red is misclassified images. [5]

(18)

3 CONVOLUTIONAL NEURAL NETWORKS

Artificial neural networks (ANNs), often known as neural networks, are computational models that imitate the neural function of the human brain. This thesis primarily focuses on the convolutional neural network (CNN), which is a particular case of neural networks.

A neural network often involves several processors that work in parallel and arranged in neurons [27]. Neurons are generally divided into unit layers including the input, output, and many hidden layers. The unit (nodes) are connected to each other and have a specific weight and bias. Each link associated with the weight reflects the strength of the connection between the inputs. The main layer receives the original input information.

Figure 12.Neural network Architecture. [27]

Each neuron receives a multiplied version of inputs and random weights, which is then combined with a constant bias value (unique to each neuron layer), and then transferred to a proper activation function, which determines the neuron’s final output value. Once the final neural network layer’s output is created, the loss function is determined, and backpropagation is applied to change the weights to make the loss minimum [27]. The main task of the neural network is to estimate a function f which takes data sample x as input and assigns them to their corresponding label y. Here, the purpose is to learn a function that can be computed as

Yˆ =f(x) (1)

whereYˆ is the categories of output tensor,f is a learnable function andxis a sample of

(19)

input data. The functionf is parameterized by a set of parametersθ, which are generally learned from a collection of data for which the labels are defined. During the training phase, the network makes a prediction and involves a cost or loss based on the correct result [28]. Backpropagation algorithm is utilized when a network needs adjustment based on the loss, and updates the individual neurons in the network and therefore makes a better prediction. The backpropagation process begins at the network’s end, with a single loss value depending on the output, and updates neurons in reverse order, with the neurons at the network’s beginning updated last.

3.1 Architecture and building blocks

CNN is a multi-layer deep neural network model, which features high fault tolerance, self- learning, and parallel processing capabilities [28]. It is a multi-layer perceptron aimed to detect 2D images with a local connection and weight-sharing network with a topology similar to a biological neural network. The primary assumption of CNN is to detect complete objects by using predefined convolution filters to identify image edges, object parts, and patterns based on this knowledge. The neurons in a CNN are organized in a three-dimensional structure, with each group of neurons evaluating a small space or an image feature [29].

CNN architecture is typically composed of distinct building blocks that transform input data into output data [28]. These include the convolutional layer, the pooling layer, and the fully connected layer (see Fig. 13). Feature extraction is performed by both convolutional and pooling layers, whereas a fully connected layer transforms the extracted features into the final output such as classification and object detection.

3.1.1 Input layer

The input layer, which can handle multi-dimensional data, represents the image input into the CNN network. Every image is a pixel value matrix. In most cases, the two- dimensional convolutional neural network’s input layer receives two-dimensional. The inclusion of different color channels in colored images, particularly RGB (Red, Green, Blue) based images, adds an extra depth field to the data and makes the input 3-dimensional.

(20)

Figure 13.Architectural design of CNN.

3.1.2 Convolutional layer

Convolutional layers are the foundation of CNN because they contain learned kernels (weights) that extract features, allow different images to be distinguished from one another. In the convolutional layer, the input image is convolved using a small matrix (for example, 3x3, 5x5, or 7x7) called kernels or filters. A weight coefficient and a bias vector correspond to each convolution kernel element, which is similar to the neuron of a network [28]. Convolution operation can be expressed as

f₁^k(p, q) =X

c

X

x,y

i_c(x, y).e^k₁(u, v) (2) wheref₁^k(p, q)refers to the component of a feature matrix,cto channel index andi_c(x, y) is an component of the input tensor of channel index I_c, and e^k₁(u, v) to component of kernelK of the layer numberl.

3.1.3 Pooling layer

Feature extraction is done in the convolutional layer and the pooling layer receives the output feature map for feature selection and information filtering. The pooling layer typically performs a maximum or average operation on the output of convolution layers to preserves the most relevant information and changes (shrinks) the spatial size of the output [13].

(21)

Max pooling is one of the known pooling operations, which extracts patches from the input feature maps, outputs the utmost value in each patch, and removes all the rest values.

A max-pooling with a kernel size of 2 × 2 with a stride of 2 is usually utilized in a practical sense. This downsamples the in-plane dimension of feature maps by an element of two.

The operation of the pooling layer can be expressed as

Z_l^k =g_p(F_l^K) (3)

where l and k indicates to number of kernel and layers respectively, Z_l^k represents the pooled feature-map of l^th layer fork^th input feature-map, whereasg_p denotes the type of pooling operation of input feature-map(F_l^K).

3.1.4 Fully connected layer

The fully connected layer of a convolutional neural network is equal to the hidden layer in a classic feedforward neural network. It means that every neuron in the preceding layer is connected to every neuron in the next layer. The high-level attributes of the input images are represented by the output of convolutional and pooling layers. The function of the fully connected layer is to use these attributes to categorize the input image into several classes depending on the training dataset. [28].

3.1.5 Activation function

The activation function is an important parameter in the CNN model. This function is used to learn and estimate any continuous and complicated relationship between network variables. In simple word, it determines the neuron’s output when an input or set of inputs are given. This output is then utilized as input to the next node, which determines which model information should be sent forward and which should not at the network’s end [28].

Several activation functions are available depending on the nature of the input values.

The Rectified linear unit (ReLU), Softmax, tanH, and Sigmoid functions are some of the most often utilized activation functions [27]. Each one of these functions has a distinct purpose. The sigmoid and softmax functions are recommended for a binary classification CNN model while softmax is commonly utilized for multi-class classification. However, the gradient of sigmoid and tanh have always non-zero value, which may be undesirable

(22)

for training. As the neural network architecture becomes more complex, the gradient signal begins to vanish. This is due to the fact that the gradient of such functions is roughly equal to zero practically everywhere except the center.

ReLU is the most preferable activation function as it has a continuous gradient for positive input. Although the function is not differentiable, it may be ignored in practice. ReLU generates a more sparse representation as the zero in the gradient yields a total zero. ReLU can be computed as

ReLu(x) =max(0, x) (4)

where ReLu(x) is an output for ReLu, and max(0, x) stands for the output ReLu is always zero when the inputxis≤ 0. It isxwhen the output is>= 0because the output is then equal tox.

3.1.6 Batch normalization

During network training, the allocation of each layer’s input varies when the parameters of the preceding layers vary. This causes training to take longer since it requires careful parameter tuning and lower learning rates, and it makes training models challenging with saturating non-linearities. To address this problem, the layer inputs need to be normalized.

This can be done by including normalization for each training mini-batch as part of the model design [30].

Batch Normalization (BN) [28] enables to employ considerably greater learning rates while being less concerned about initialization. The main principle behind batch normalization is, when entering each layer of the network, insert the normalization layer, that is, perform the normalization process first, and then enter the next layer of the network.

It is a learnable, parameterized network layer and reduces the requirement for dropout in some circumstances.

3.1.7 Dropout layer

The dropout layer is part of the network used to prevent CNN from overfitting. The dropout generates a sparse activation from a particular layer, which interestingly, in turn, promotes a network to learn a sparse representation as a side effect. During training,

(23)

the random sample of the weight layer parameter is taken with a particular probability and these sub-networks are used as the target network for this update. It can be assumed that if the total network has n parameters, the number of sub-networks that can use is 2ⁿ. Furthermore, when n is high, the sub-networks used in each iteration update are not repeated, thereby avoids a particular network from being over-fitted to the training set [28].

3.2 Training

A network training is carried out by acquiring kernels and weights in convolutional and fully connected layers that reduces the gaps between output predictions and provided ground truth labels on a training dataset [27]. The parameters of the network are learned from a collection of training datasets.

The cost, also known as the loss function, determines the parameters that are appropriate for the training data by assessing the error that the network produces while predicting the training samples. The choice of the proper cost function is mostly related to the task a Neural Network is used for. The aim of the learning process is to reduce the cost concerning the parameters of the network. Therefore during training the network, the following points are considered:

• Loss function: can also be considered as a cost function, assesses the similarity between the network’s output predictions through forward propagation and supplied ground truth labels. Cross-entropy is a popular loss function for multi-class classification. [27]. In contrast, mean squared error is commonly used in regression to continuous value.

• Gradient descent: is frequently used as an optimization techniques for minimizing loss by iteratively adjusting the network’s learnable parameters, such as weights and biases. The gradient of the loss function determines the extent to which the function has the steepest rate of increase and every trainable parameter is adjusted with an arbitrary step size specified by the learning rate within the negative direction of the gradient [30] [28].

The gradient of the loss function is computed using backpropagation algorithms.

Backpropagation propagates the neural network’s total error through the network’s connections layer by layer and calculates the gradient of each weight and bias in each layer. Then the gradient descent algorithm optimizes the weights and biases to minimize the neural network’s total error [28].

(24)

• Datasets and ground truth labels: are also considerable components that greatly influence the performance of network training. Effective training and model evaluation requires careful collection of data and ground truth labels.

(25)

4 POSE ESTIMATION

4.1 Approaches and taxonomy

This chapter aims to present a general overview of pose estimation and its approaches.

Traditional techniques and recent advances are discussed in the context of animal pose estimation. Pose estimation is a substantial study concern in the area of computer vision [31], which refers to the geometrical configuration of posture and orientation of an object or humans including animals in an image or video frame. With pose estimation, it is easy to detect objects or persons in real-world space. Pose estimation is valuable for accurate modelling, particularly in datasets with large pose variations [7]. It typically predicts the accurate location of keypoints related to an object or a person.

Pose estimation can be divided into two dimensional (2D), three dimensional (3D) and dense pose estimation based on the performance of the dimensional requirement as shown in Fig. 14. The 2D pose merely determines the position of keypoints in the 2D image, that is, for each keypoint the model predicts an X and Y coordinates. On the other hand, 3D pose estimation estimates the 3D spatial arrangement of all key points in an image [31].

Figure 14. Pose estimation approaches: (a) Two dimensional, (b) Three dimensional, (c) Dense. [31]

Discriminative and generative approaches are the two categories of 3D pose estimation [32]. 3D pose estimation operates to translate an object into a 3D object in a 2D image by applying a z-dimension to the prediction regarded as discriminative pose estimation whereas some end-to-end 3D Pose estimation models directly predict 3D pose, this is known to be as generative pose estimation. A dense pose produces dense corre- spondences to a 3D, surface-based representation of an object or person from a 2D image.

It works on videos and is used to record humans and other animals’ locations and soft tissue [31].

(26)

Depending on the number of objects or a person to track, pose estimation can also be a single and multi-pose estimation approach [33]. As it is shown in Fig. 15, one person or object is detected and tracked by the single pose estimation method, while multi-pose estimation approaches detect and track several people or objects.

Figure 15. Pose classification: (a) Single person pose estimation, (b) Multi-person pose estimation. [33]

The research on human pose estimation has started to move from the traditional approaches to the deep learning approach with the implementation of DeepPose [34]. Most deep learning approaches to pose estimation have universally comprised CNNs as their key building block, essentially eliminating graphical models; this approach has resulted in substantial improvements over traditional benchmarks.

4.2 Human pose estimation

Today, major studies on pose estimation problems revolve around human pose estimation [34]. Even though it is a difficult task, due to the development of deep learning methods, its application gain popularity in different areas such as gaming, virtual reality, sports, security analysis, video surveillance, and medical help. A human pose estimation denotes the position and orientation of a set of keypoints, such as the head, the middle of the body, the left or right knee, and the left or right shoulder in the image [35]. Essentially, each keypoints is connected to define the individual’s pose. The human pose estimation indication is shown in Fig. 16.

There are also common methods utilized in estimating the poses of individuals from the image. The easiest way is known as the top-down approach. This approach first locates a human detector, accompanied by estimating the body parts, and then measuring the pose for each person. Another approach is to classify all body parts in the image, accompanied by grouping parts belonging to individuals. This method is known as the bottom-up

(27)

Figure 16.Human pose estimation. [35]

approach [33]. Generally, the core work of human pose estimation is simplified into two procedures: locating human body keypoints and organizing these keypoints into correct human pose configurations as seen in Fig. 16.

Human pose estimation problems have been progressed and improved tremendously with the aid of deep learning. DeepPose [34] is the first implementation of Deep Neural Net- works (DNNs) for estimating the human pose. The authors suggested the DeepPose model in order to formulate pose problems as a keypoint regression issue. The position of each body joint was regressed by DNN-based regressors based on full-image. Further, DNN- based pose predictors were used for predicting joint localization. The DeepPose model is shown in Fig. 17.

Figure 17.The DeepPose model. [35]

The stacked hourglass model [30] is a deep learning model for human pose estimation.

Its architecture consists of different layers, including:

• Convolutional layers are responsible for extracting features from the image.

• MaxPooling layers removes part of the image unnecessary for feature extraction.

(28)

• Residual layers drive layers further into the network.

• Bottleneck Layers free up memory by using less intensive convolutions.

• Upsampling layers increase the size of the input.

In general, the hourglass module is both an encoder and decoder architecture where the input features are initially downsampled and then upsampled the features to retrieve the information and form a keypoint heatmap. Each encoder layer has a link to its decoder equivalent, and the layers can be stacked as much as needed. The stacked hourglass architecture is shown in Fig. 25.

PoseNet [36] is another deep convolutional neural networks model learning used for human poses estimation by detecting keypoints from the image or videos [37]. These keypoints are encoded by the confidence scale, which is the value between 0 and 1, with 1 being the highest. The PoseNet model is invariant with the image size, so it can estimate the position on the real image, regardless of whether or not the image has been scaled.

In PoseNet [36], a sequence of fully connected layers substitutes the softmax layer. The encoder is the first layer of the network architecture. It is capable of producing an en- coding vector, a 1024-dimensional vectors that encodes the input features and localizer is the second architecture of the network that produces a matrix that denotes a localization function. The last part is a regressor composed of two layers linked to each other that are used to regress the final pose. The PoseNet architecture is shown in Fig. 18.

Figure 18.PoseNet model. [37]

4.3 Animal pose estimation

Animal pose estimation is the process of recognizing the animal’s posture. Hypothetically, pose estimation is a prerequisite task for both the human and animal re-identification

(29)

process since both of the tasks are confronted with image retrieval problems. Animal pose estimation is challenging because animals generally have a diverse range of pose variation. Early studies have shown that the animal re-identification approach focuses mainly on human re-identification and has produced good research results [38] [39].

Approaches to animal pose estimation are the same as human pose estimation. It includes 2D and 3D pose estimation, dense representation, and skeleton detection. However, the scarcity of annotated datasets is one of the key problems of applying these methods to animal pose estimation. Thus, human pose estimation models are generally trained on a large number of image datasets, and the existing network models have customized so- lutions based on human body key points during estimation. Therefore, to apply these approaches directly to animal datasets, it is necessary to annotate new dataset [31] [38].

4.3.1 Pose classification

Re-identification of individual animals depends on the visibility of the pattern. Pattern enabling the re-identification typically varies between sides. Therefore, it is important to know which side of the animal is visible in advance. Identifying the visible side of the individual animals is related to the pose classification task and can be formulated as an image classification problem. There are different categories of pose classification (see Fig. 2, (a)), for example, by classifying the orientation of the animal in the image as right, left, frontal, and back to mention some of them.

In [40], tiger poses are classified based on the right view or left view as the stripes of individual tigers vary between the sides and used the classified pose to guide the re- identification feature learning procedures as a supplementary task (see Fig. 19).

CNN architecture indirectly incorporates the advantages of traditional neural network training with the convolution procedure [13]. For pose classification purposes, the network receives the image with the pose of a person or an object as input and attempts to correctly recognize the different poses to predict them precisely. Mathis et al. [41] implementation was based on DeeperCut model [42]. The DeeperCut is a model for image classification based on the residual neural network (ResNet) architecture [43].

(30)

Figure 19. Pose-guided classification of Amur tiger. The indication of right and left sides of individual tiger with different stripe patterns. Right view II of tiger A indicates that due to varying illumination conditions, the appearance of the same side of one tiger could look different. [40]

4.3.2 Skeleton detection

The identification of skeletons reflects the animal’s location in a graphical format. It is a set of joints that can be associated to define the pose of the animal. Each coordinate in the skeleton is recognized as part of the skeletons or keypoints [38]. Most approaches to skeleton detection deal with the joints or keypoints of the animals. The sample animal pose skeleton is shown in Fig. 20.

Figure 20.Keypoint detection: limps with the highlighted color. [38]

The first use of CNN for animal pose estimation was presented in [41]. The method was applied to estimate the position of a single individual animal pose or multiple poses utilizing a fully convolutional neural network (F-CNN, an encoder-decoder model [29].

Essentially, these models are emerged from the literature of human pose estimation pro- cesses and utilized to estimate animal posture. The network is trained to convert images into probabilistic approximations of joint locations, known as confidence maps. These confidence maps are analyzed to generate 2D feature points for each keypoint.

Cao et al. [38] used a weakly- and semi-supervised cross-domain adaptation (WS-CDA)

(31)

structure to estimate the pose of unseen animals with domain adaptation based on the human pose-labeled dataset, and a box-labeled and pose-labeled animal dataset. The WS- CDA model contains three components: the feature extractor, the domain discriminator, and the keypoint estimator. The feature extractor extracts features depending on which the domain discriminator aims to define the input data and the keypoint estimator calculates the keypoints. The discriminator assists the network with the keypoint estimator and domain discriminator to fit training data from distinct domains.

Zhang et al. [44] proposed the Omni-supervised joint detection and pose estimation for the wildlife. From the introduced wild dataset, the kangaroos pose appeared with multiple poses. To tackle the problems, they developed CNN-based recognition and pose estimation techniques to annotate kangaroo’s images based on keypoints of the human body.

In [40], a Part-Pose Guided Network system for the re-identification of tigers was developed. Based on the pose skeleton annotation, the recommended method collects the whole details of the tiger body, including the head, hind thighs, left and right of the front legs and trunks.

Li et al. [45] presented a deep cascade conventional approach to estimate the cattle pose from key points (see Fig. 21). They used a set of joints from the human body. The method comprises three models, namely: the stacked hourglass, the convolutional pose system, and the convolutional heatmap regression models. Adequate pose prediction accuracy was obtained from the hourglass model.

Figure 21.Deep cascade convolutional mode. [45]

(32)

4.3.3 3D modelling

3D modeling refers to the 3D locations of physical coordinates of an object in an image or video. 3D modeling or 3D detection is used for estimating the pose of a subject and for action classification. In real-world applications, recovering a 3D pose from the 2D image is a very demanding task because the captured image is associated with different variations in the background, occlusions, pose, illumination, and camera parameter problems.

In [46], approaches of 3D modeling were presented from a collection of labeled image data and a 3D model template. The approach operates based on local rigidity that deals with how much each face of the mesh can deform. Fangbemi et al. [47] introduced syn- thetic training data to approximate 2D and 3D pose using the Zoobuilder pipeline for each joint of the animal skeleton. The pipeline still shows some limitations when exposed to complex videos subject to occlusions and low contrast problems.

To estimate articulated 3D models of the generic animal form and its location, the Skinned Multi-Animal Linear (SMAL) model [46] was adapted. In [48] the authors proved that the image integrated with this model provides a good estimation outcome than the image not included in this model. In [7], the model was used to approximate Gravy’s zebra’s 3D posture and shape from a single image. The model is shown in Fig. 22.

Figure 22.3D pose estimation of Gravy’s zebra with the SMALL model. [7]

(33)

5 SEAL POSE ESTIMATION

This chapter presents the seal pose estimation based on keypoint detection using a stacked hourglass network. The motivation of the selected approach is driven by the effectiveness of the model as it mainly reprocesses the whole seal body keypoint information to improve the accuracy of detecting a single keypoint. The model handles a varied and challenging set of seal poses with a simple technique for the evaluation of initial predictions.

5.1 Pipeline

A crucial step toward understanding seal pose estimation from images is to precisely de- termine the accurate pixel location of important keypoints from the seal body part and the final pose estimation requires a consistent understanding about the overall seal keypoints [45]. It requires the localization of keypoints in challenging, uncontrolled situations (regardless of their poses). In short, the work is to localize and estimate seal joints from images. The main difficulty lies in reducing the complexity of the model analysis algorithm and being able to adapt to various changing conditions such as a seal on complex poses. The pipeline for the seal pose estimation is shown in Fig. 23.

Figure 23.Pipeline of the proposed approach.

(34)

The Pipeline includes the following steps:

• Data Preprocessing: in order to achieve the proposed task and make the input images ready for the model, the image is preprocessed. This includes image resizing, segmentation, and tone mapping steps.

• Stacked hourglass network implementation: The Stacked hourglass network is applied for keypoint detection purposes.

• Post processing: based on the results obtained from the initial experiments, the input data is post processed. This includes preparing the input data with visible keypoints and for further experiments to evaluate how the model perform well in detecting the keypoints from the given ground truth data.

5.2 Data pre-processing

According to the requirements of the convolutional neural network models, the original images have to be preprocessed. This includes changing the input data to make it suitable for the model training. In order for the model to simplify the representation and making it more useful for the analysis and interpretation, it is important to segment the seal from the background image because the quality of the original seal images is too low due to the varying illumination conditions. This could affect the performance of the network in predicting keypoints accurately.

The seal is segmented from the background image using deeplab model as proposed in [25]. Tone mapping has also been shown to affect the segmentation result. This solves the problem of reducing huge contrast from the scene light to the displayable range while maintaining image details. Perceptual contrast processing is used to eliminate the varying lighting conditions and color differences caused by different cameras and contrast mapping and contrast equalization using histogram equalization are applied for contrasting dynamic range of the image as proposed by Mantiuk et al. [49].

Image resizing is another important factor during this step. First, a square area containing the seal body is cropped from the original image. Then the standard input data was obtained by scaling the image to 256×256. The example of data preprocessing is shown in Fig. 24.

(35)

Figure 24.Data preprocessing.

5.3 Stacked hourglass network

The stacked hourglass network is a cascaded CNN network architecture that depends on several stacked hourglass modules. The stacked hourglass network introduced by Newel et.al. [30] (see Fig. 25) was specifically presented for the task of human pose estimation. This network adapted an ideas from the Residual Networks [43], convolution- deconvolution architectures [37] and intermediate supervising of the network during training. Instead of generating or regressing a vector with joint locations, hourglass follows the motivation of generating heatmaps, which represent the likelihood of each of the spatial locations to represent a specific joint. The heatmaps assign a Gaussian distribution around the target points. This way, the network prioritizes a local approximation over a global estimate and can learn the variations in the appearance surrounding the joints. This minimizes the average error with respect to the exact target locations.

Figure 25.The Hourglass model. [30]

The preprocessing stages followed by the network consists of downsampling the image to match the input size of the hourglass [30]. In particular, the network starts with 7 × 7 convolutional layers with stride 2 and 64 feature maps. The output of this layer is 128

× 128 × 64. Then, a residual layer, using the bottleneck where the skip connection is

(36)

replaced by a 1 × 1 filter is used to bring the feature maps from 64 to 128. Then, a maxpooling layer brings the resolution from 128 × 128 to 64 × 64. Finally, two more residual layers with 1 × 1 filters are used to bring the number of feature maps to 256. The first part of the network preprocess is shown in Table 1.

Table 1.Preprocessing step which is done by the network.

Layer Kernel Stride Padding Output

Input image 256 x 256 x 3

Convolution 7 2 3 128 × 128 × 64

Convolution 3 1 0 128 × 128 × 128

Max pooling 2 2 0 64 × 64 × 128

Convolution 3 1 0 64 × 64 × 256

5.3.1 Residual module

The residual modules are used in the prepossessing stage of the network, which retrieve higher-level features while retaining the original level of information [45]. The image size is not changed during this process, only the data depth is changed. It can be considered as a size-preserving advanced convolution layer. Two pipelines of residual module comprises convolutional layers with batch normalization and Relu, and the convolution kernels are varied sizes. Another is a skip layer from input to output with a 1×1 convolution. The output of this pre-processing step is size 64 × 64 × 256, which is forwarded to the hourglass.

5.3.2 Hourglass module

Stacked Hourglass Networks yielded the outputs as a collection of heatmaps and each heatmap represents the probability of a keypoints in each pixel. Each stacked network layer includes the basic hourglass module, the optimum number of residual modules, a convolutional layer, batch normalization, ReLU, and another convolutional layer. The network gets a heatmap of the localized number of keypoints in 64x64 size, which is the value of output channels [30].

The main task of the hourglass module is to obtain information at various scales because when capturing different body parts, local features are needed and general information is

(37)

needed to finally predict the pose. In order to capture image features at different scales, a common method is to use multiple pipelines to process different scales of information and then combine these features in the last part of the network. The method used by the author [30] is to use the skip layer. A single pipeline can store spatial information of each scale.

5.3.3 Heatmap distribution

The output calculated by the hourglass module to create the heatmap is passed to the rest of the layers and delivered via a 1x1 convolution layer set. The first layer holds the depth at 256 and the second layer downsamples the depth to the number of heatmaps. A separate ground truth heatmap is created at each keypoint location in the training network and a 2d Gaussian map is generated based on the keypoint label. If the keypoints are annotated on the image or if the specified keypoint is not displayed the heatmap is set to 0 for the entire image. [30] [45].

5.3.4 Intermediate supervision and training

The network consists of multiple hourglass modules stacked one after another and linked with intermediate supervision by adding a loss function at the end of each hourglass. This means that to train the network, stacking the hourglass is the best option so that the output of the single hourglass serves as the input to the next one.

The output of the first hourglass splits in two directions before being element-wise added together again before the second hourglass. One path goes through two residual units and the other one generates a heatmap of the keypoints. This whole idea refers to intermediate supervision and provides an alternative to enhance the performance of the network as a whole. The illustration of intermediate supervision is shown in Fig . 26.

The network training is initialized by taking an input image of size 256x256x3 to produce a series of K heatmaps, each of which has the size of 64x64, that is the network output is 64x64xK. Previously the network was designed to generate 16 heatmaps, one per joint.

The use of 16 heatmaps rather than just one that aggregates all the joints is because this configuration allows to handle occlusions in a much more efficient way. Once each of the heatmaps is generated, the coordinate of the maximum element in the heat map corresponding to each joint is calculated, and then scale it back as the final predicted keypoint

(38)

Figure 26. The illustration of intermediate supervision. The blue box indicates the loss function at the end of each stacked hourglass. [30]

position.

(39)

6 EXPERIMENTS

This chapter describes the experimental part of the thesis to evaluate the performance of the stacked hourglass model for seal pose estimation based on keypoint detection approaches. The experiment was done on the selected Saimaa ringed seal image dataset.

First, it discusses the datasets used for the formulated tasks and evaluation criteria used for the evaluation of the results performed by the model. Then the description of the experiments will be followed. Finally, post-processing and evaluation will be discussed.

6.1 Data

The dataset used for this work is the Saimaa ringed seal image database and was gathered by Saimaa ringed seal research group at the University of Eastern Finland (UEF) [50].

This database includes 2134 original images with different seal poses and variations. A subset of the database is used for training. The majority of the images consist of a single individual Saimaa ringed seal, while a few comprises two or more individuals. The training subset contains 600 images for both training and validation set as the network requirement. The sample of the Saimaa ringed seal images is shown in Fig. 27.

Figure 27.Examples from the Saimaa ringed seal .

(40)

The images were manually annotated and labeled according with 5 seal body keypoints including 1-muzzle, 2-top of the head, 3-tips of the left fore flipper, 4-tips of the right fore flipper, and 5-tail using in the image using open source annotation tools (see Fig. 28).

Then, all of the annotation information was saved in a text file to be used in the network training. The sample of seal keypoint annotation is shown in Fig. 29.

Figure 28.Sample that shows five selected keypoints.

Figure 29.Sample of annotated images from saimaa ringed seal dataset.

(41)

The first input to the network training was the original datasets which consist of seal images with a large variety of light intensities. This could affect performance of the network in predicting keypoints accurately, that is, the confidence score of some keypoints was low and increases the cost of training largely. Therefore, the original images were segmented (see Fig. 30).

Figure 30. Sample of original and segmented images. Left: original image. Right: segmented image.

6.2 Evaluation criteria

Percentage of Correct Keypoints (PCK) and Part Detected Joint (PDJ) evaluation metrics were used to evaluate the accuracy of the predicted correct keypoints by the model [45].

PCK calculates the percentage of detection rates that are within a normalized distance of the ground truth. For the above stated seal dataset, distance is normalized by considering the whole body of the seal. PCK can be calculated as

P CK_j =

N

P

i=1

α(kP_i,_j−t_i,_jk)−t·b_i

N (5)

whereP CK_j indicates thej^thkeypoints on sample dataset,Nis a total number of sample data from the dataset, t_i,_j is the heatmap which is predicted for part j of the i^th image, t_i,_j refers to ground truth generated by a 2D Gaussian distribution onj^th keypoint. t is a threshold value between 0.05-0.1 and theb_iis the length of the body of thei^thimage. The

(42)

α(x)is the function delta which equals to 1 ifx≤0and equals to 0 otherwise.

In PDJ, if the difference between the predicted and actual keypoints is within a certain fraction of the bounding box diagonal, the discovered Joint is regarded as accurate. The usage of the PDJ metric indicates that the correctness of all determinations is evaluated using the error threshold. PDJ can be computed as

P DJ =

N

P

i=1

bool(d_i ≤0.05·diagonal)

N (6)

where d_i represents the the euclidian distance between ground truth keypoint and predicted keypoint on sample dataset, boolis function that returns 1 if the condition is true, 0 if it is false andN is the he number of key points on the image.

6.3 Description of experiments

The implementation of a stacked hourglass network for the seal pose estimation was done using python 3.8 and TensorFlow 1.15. This package allows for easy and flexible network design, as well as for efficient backpropagation and GPU computing. The code is adapted from the available implementation of Newell et.al. [30], using the same training and validation partitions.

The network training was done on a remote server with NVIDIA Tesla V100. The network was fine-tuned using few parameters to enhance its performance prediction in the seal keypoints more accurately. Because of GPU memory limitations, the original minimum batch size was changed from 16. The mini-batch gradient descent technique with the RMSprop method [30] was utilized for network training. The batch size was assigned to 4, the learning rate to 0.0025 and the optimization was performed for 50 epochs, each using a random subset of 600 training data. At the start of each epoch, the training data was randomized before being partitioned into batches. The input data was augmented at a random rotation of 60 degrees to promote generalization and minimize overfitting. The learning rate was lowered at every 200 rate when the loss stopped decreasing.

In this work, two experimental settings were done based on the depth of the stacked hourglass network and measure the accuracy of the predicted keypoints quantitatively and qualitatively. The experimental setting I was applied on the training and validation

(43)

set. In this case, 600 images were used as a training and test set.

According to the network implementation, the number of stacks is fixed to 8. Therefore, in an experimental setting I, 8 experiments were done starting from lower stacks to the higher number of stacks to see the effect of intermediate supervision in providing the results that are achieved by the output of heatmap in each hourglass. The 8 network training was completed using different data parameters based on the depth of the hourglass. Ex- perimental setting II was done based on the observation gained from experimental I. In this setting, 30 randomly selected images were prepared depending on the visibility of the keypoints test images, and the same parameter was used for training as it was applied in an experimental setting I.

Table 2.Experiment setting

N0. Number of hourglass Image size Experiment setting 1 one stack hourglass 512 and 256 setting I and II

2 Two stacked hourglass 512 setting I

3 Three stacked hourglass 512 setting I 4 Four stacked hourglass 512 and 256 setting I and II

5 Five stacked hourglass 512 setting I

6 Six stacked hourglass 512 setting I

7 Seven stacked hourglass 512 setting I 8 Eight ordered hourglass 512 setting I

6.4 Results

In this work, the network performance for predicting the seal keypoints was evaluated based on PCK and PDJ evaluation metrics. The comparison of the PCK and PDJ performance of the model at a normalized distance of 0.5. PCK and PDJ were computed based on the annotations of all keypoints, referencing the body size of the seal centered at 0.5.

Table 3.PCK evaluation metrics for the five keypoints.

Metrics Muzzle Head LeftForeFlipper RightForeFlipper Tail Mean

PCK 54.1 35.9 27.9 29.8 13.0 32.2

(44)

Table 4.PDJ evaluation metrics for the five keypoints.

Metrics Muzzle Head LeftForeFlipper RightForeFlipper Tail

PDJ@0.10 76.7 30.0 66.7 50.0 20.0

PDJ@0.20 76.7 30.0 66.7 50.0 20.0

PDJ@0.30 76.7 30.0 70.0 50.0 20.0

PDJ@0.40 76.7 30.0 76.7 50.0 20.0

PDJ@0.50 76.7 36.7 83.3 50.0 20.0

As it is shown in Table 3 and Table 4, the error per keypoint is calculated and the performance of the model is considerably better in case of localizing the muzzle comparing to the rest keypoints using both PCK and PDJ evaluation metrics. The PDJ curve of all keypoints is shown in Fig. 31.

Figure 31.Percentage of detected keypoints on Saimaa ringed dataset.

The results obtained from experimental setting I is given in the Table 5. Throughout 8 experiments, the network prediction increased from 65% to 76.7% as the depth of the network increases from lower stack to higher number of stacks. It can be seen that stacking more hourglass improves the results, despite these being more than marginal improvements.

The results obtained from experimental setting II is reported in Table 6 and the experi-

(45)

Table 5.Network training at each number of hourglass.

No. Number of hourglass Image size Experiment setting Prediction accuracy 1 one ordered hourglass 512 and 256 setting I and II 65.0%

2 Two ordered hourglass 512 setting I 70.5%

3 Three ordered hourglass 512 setting I 72.0%

4 Four ordered hourglass 512 and 256 setting I and II 72.0%

5 Five ordered hourglass 512 setting I 72.5%

6 Six ordered hourglass 512 setting I 73.0%

7 Seven ordered hourglass 512 setting I 75.0%

8 Eight ordered hourglass 512 setting I 76.7%

ments in this setting is done based on the seal’s definite pose having visible keypoints (see Fig. 32). 65% of accuracy is obtained in the first stack whereas 78.8% is obtained as it was trained with fourth stacks.

Table 6.Network prediction on visible keypoints.

No. Number of hourglass Image size Experiment setting Prediction accuracy 1 one ordered hourglass 512 and 256 setting II 65.0%

2 Four ordered hourglass 512 and 256 setting II and II 78.8%

Figure 32.Example of network prediction based on definite pose and visible keypoints.

The results obtained by the network were promising (see Fig. 33). Thus, the method was able to detect the missed keypoints. This shows that the network has greater capability to detect missed key points, especially for the seals that appeared on the same poses.

(46)

Figure 33. Example successful keypoint prediction. Left: the ground truth. Right: the predicted keypoints. The different color indicators represent the localization of the five keypoints: the muzzle is represented by blue color, head by orange, tips of the left fore flipper by green, tips of the right fore flipper by red, and the tail by magenta.

(47)

As it is shown in Fig. 34, the sample results predicted by the network throughout all experiments demonstrate that the network handles a variety of poses with many occlusions. When observing the predicted results, it seems that the network is rather invariant to changes in these conditions. The network generally made some mistakes, when estimating a pose with visible keypoints and the pose that is occluded and that does not deviate too much from the norm.

Figure 34. Example of prediction mistakes made by the network. The different color indicators represent the localization of the five keypoints: the muzzle is represented by blue color, head by orange, tips of the left fore flipper by green, tips of the right fore flipper by red, and tail by magenta.

Seal pose estimation using convolutional neural networks