Instance segmentation of Ladoga ringed seals

(1)

Computational Engineering and Technical Physics Computer Vision and Pattern Recognition

Andrei Lushpanov

INSTANCE SEGMENTATION OF LADOGA RINGED SEALS

Master’s Thesis

Examiners: Professor Heikki Kälviäinen Professor Vladimir Pilidi Supervisors: M.Sc. Ekaterina Nepovinnykh

D.Sc. Tuomas Eerola Professor Heikki Kälviäinen

(2)

Lappeenranta-Lahti University of Technology LUT School of Engineering Science

Computational Engineering and Technical Physics Computer Vision and Pattern Recognition

Andrei Lushpanov

INSTANCE SEGMENTATION OF LADOGA RINGED SEALS

Master’s Thesis 2020

47 pages, 32 figures, 3 tables.

Examiners: Professor Heikki Kälviäinen Professor Vladimir Pilidi

Keywords: computer vision, machine vision, image processing, pattern recognition, animal re-identification, instance segmentation

The wildlife photo-identification is an important issue today since it allows to identify and to track animals. It helps scientists to monitor the endangered species, although it is difficult to explore all the image volumes. The first step in the identification pipeline is to segment the depicted animals. Thus, the goal of the thesis is the segmentation of the Ladoga ringed seals. The similar problem has been solved for the Saimaa ringed seals using the binary segmentation method, but this approach does not fit to the task since the Ladoga seals are typically captured in groups. So, the instance segmentation method is needed since it distinguishes the objects presented. Convolutional Neural Networks (CNNs) are known as highly efficient approaches in this field. That is why, Mask R-CNN was selected as a method to study. Six models built on various architectures were trained and evaluated on the sets of manually annotated images. Finally, the experiments showed that the ResNet architecture with feature pyramid was the best approach for segmentation, while ResNet with dilated convolutions was the most accurate at counting the seals.

(3)

First of all, I would like to thank my team of supervisors: professor Heikki Kälviäinen, D.Sc. Tuomas Eerola and M.Sc. Ekaterina Nepovinnykh for the professional guidance they provided me with during the whole process of working on the thesis. I was able to dive into the research field due to their valuable advice.

I also must express my gratitude to my family, especially my parents and Tanya. I would not have been here, in Lappeenranta, without their motivation and help.

Finally, I want to acknowledge the support my friends gave me during the whole year.

Thank you Vova, Gleb, Yan, Nikita and Seriy.

Lappeenranta, May 25, 2020

Andrei Lushpanov

(4)

LIST OF ABBREVIATIONS

CNN Convolutional Neural Network FCN Fully Convolutional Network FPN Feature Pyramid Network IoU Intersection over Union mAP mean Average Precision R-CNN Regions with CNN features RoI Region of Interest

RPN Region Proposal Network

SIFT Scale-Invariant Feature Transform SVM Support Vector Machine

VIA VGG Image Annotator

(7)

1 INTRODUCTION

1.1 Background

In the past few decades, animal tracking has become an extremely important task for bi- ologists [1]. The reason for this interest is that this process allows scientists to understand the behavior of animals and perform some conservation efforts, that is, impose some re- strictions on human activities, such as infrastructure building and deforestation. Such actions might play a vital role in preventing the extinction of endangered species.

There are different ways of tracking an animal. For example, tagging is a widespread technique, although it disturbs an individual a lot. The reason is that when an animal is caught, it may get injured or stressed [2]. The photo-identification is more animal-safe approach since it is based on studying the individuals depicted in images which can be obtained by the automatic wildlife camera traps or by humans. This is a highly convenient and environmentally sound way to capture images of the wildlife population in contrast to tagging [3]. Considering the fact that camera traps provide scientists with huge amounts of data, it is obvious that tracking a particular individual manually is almost impossible.

On top of that, the whole identification process takes a lot of time and human resources.

To summarize, it is obvious that the process must be automated.

The Ladoga ringed seal is considered as an animal in the thesis. The segmentation task has already been solved for another ringed seal subspecies — Saimaa ringed seals [3].

However, the same approach cannot be used in this case since Ladoga seals are typically captured in groups, unlike Saimaa subspecies. It means that the instance segmentation technique must be applied in order to correctly segment the depicted seals as shown in Fig. 1.

In order to perform the instance segmentation, first of all, the depicted objects must be detected. In the second step, the segmentation of each instance is needed. Thereby, two subtasks require a solution. Convolutional neural networks have proven their reliability for both of these problems during the last decade [4], [5], [6], [7].

(8)

(a) (b)

Figure 1.Instance segmentation: (a) Original image; (b) Segmented image.

1.2 Objectives and delimitations

The thesis focuses on the automatic segmentation of the Ladoga ringed seals. The specific objectives are stated as follows:

1. to collect a data set of the Ladoga ringed seal images and to annotate the contours of seals manually;

2. to implement, train, and evaluate the existing methods on the collected data set in order to find out the most beneficial approach to the problem of the Ladoga ringed seal instance segmentation ;

3. to evaluate the numbers of the detected seals per image since this information will be utilized in the future work on the Ladoga ringed seal re-identification.

Thus, the final goal is to analyze and to compare the applied instance segmentation methods. Developing a new one is not the objective of the thesis. Additionally, it only tar- gets the Ladoga ringed seals, so other species are not considered. Moreover, the re- identification part is not considered either.

1.3 Structure of the thesis

The thesis is organized as follows. Chapter 2 is devoted to animal biometrics. In particular, a theoretical description of the animal re-identification process is presented. Besides that, it contains general information on various techniques that can be utilized in order to detect or to identify animals. Chapter 3 mainly focuses on CNNs that are suitable

(9)

for image segmentation. Namely, a brief description of the existing CNN-based methods suitable for this task is provided. On top of that, the basic theory of the problem of image segmentation is presented taking into account instance and semantic segmentation approaches. Chapter 4 provides a detailed description of the utilized method Mask R-CNN.

Basically, it concentrates on the theoretical background of the technique and its backbone architectures used for the instance segmentation. In Chapter 5 the experiments performed after the implementation part are presented. Particularly, it includes the preparation of training data, comparison of the algorithms and, finally, the results obtained. Chapter 6 observes current research in this field and suggests ways to continue the work. In the end, in Chapter 7 the conclusions are given.

(10)

2 ANIMAL BIOMETRICS

2.1 Animal re-identification pipeline

Basically, animal re-identification is a process aimed at the animal individual identification. To be more specific, let us assume that there is a database containing certain features, for example, fur patterns that belong to a number of individuals. Having the image with an animal depicted in it, the objective is to determine whether any of the already "saved"

individuals matches the depicted one, basing on the comparison of their features, or is it a new one which has not been captured so far. If the individual has not been observed earlier, then it must be "saved" into the database.

Generally, the re-identification pipeline can be split into the following steps as shown in Fig. 2 [3]:

1. segmentation;

2. post-processing;

3. identification.

Figure 2. Re-identification pipeline [3].

The main purpose of the segmentation part is to get rid of the background. Otherwise, the identification algorithm is likely to learn to identify the environment instead of the seal

(11)

itself [8]. As it was mentioned before, the Ladoga ringed seals prefer to stick together, thereby, the individuals depicted in the image must be distinguished from each other.

That is why the binary segmentation approach is not suitable for this task, but instance segmentation is. Both these approaches are discussed in Chapter 3.

Post-processing includes steps aimed at obtaining a pelage pattern. In order to get as correct pattern as possible, image processing techniques such as color normalization and contrast enhancement can be applied [3]. The first one might be useful for fitting all the depicted seals to a similar color histogram since the color of the seals does not change significantly. This operation helps in the identification part. The second technique is necessary in order to make the pattern more visible, that is, to increase the contrast between the dark and the light parts of the pelage.

The identification of various animals including the Ladoga ringed seals is possible due to their unique pelage patterns. That is, having saved the fur pattern of some individual, it can be used in the future for comparison with the fur pattern from another image. For example, texture classification methods can be utilized for identification as the seal’s pelage pattern forms a texture [2].

2.2 Animal detection

In order to perform the first step of the animal re-identification pipeline, which is segmentation, animals must first be detected. The earliest animal detection algorithms were based on the face and head detection approaches [9], [10]. The main problem is that these methods are highly sensitive to the pose of the depicted animal. Of course, there are much more beneficial techniques today that are based on various CNNs. However, it is not practicable to simply apply a CNN to the input image. The problem is that the fully-connected layer receives one vector representing one object of interest. Obviously, there may be several animals in the image. Thus, it is preferable to select some Regions of Interest (RoIs) and to apply CNNs to each one. Such approach was proposed by R.

Girshicket al. in [4]. This technique is described in section 4.1.1. Typically, the following CNNs are used for this kind of tasks due to their reliability and performance results:

AlexNet [11], VGGNet [12], ResNet [13], GoogleNet [14].

(12)

2.3 Animal re-identification

After the animal has been detected, it is possible to check whether this individual had been observed earlier. The first re-identification techniques were based on the selection of qualitative descriptors [1]. The user had to manually select a unique characteristic, such as a nick or a scratch, and to save its location in the database, containing pairs (characteristic, location). Such pairs are called descriptors. Each descriptor was coded in a unique way depending on the characteristic and its location. After that, the application calculated the maximum sum of similarities of the descriptors and returned the most similar individual.

This approach was used in one of the first methods [15] aimed at the re-identification of humpback whales, which maximum accuracy was43%. Obviously, these approaches are obsolete today.

2.3.1 HotSpotter

J. Crallet al. [16] proposed an algorithm HotSpotter based on the extraction and comparison of keypoints or "hot spots". An example of a successful comparison can be seen in Fig. 3 [3]. Although two versions of the method were presented, their general description is identical:

1. locate keypoints using the Hessian-Hessian operator applied to the image, generate elliptic regions of the keypoints and extract associated descriptors using the Scale- Invariant Feature Transform (SIFT);

2. determine image matches based on comparing these descriptors.

The first version (one-vs-one) compares the query image with each one from the database individually. Basically, it processes all the database images, counting the similarity scores with the current one. The similarity score of two images is calculated by the following formula:

X

(i,j,ri,j)∈M_D

r_i,j (1)

whereM_D is a set of matches. A match is inserted intoM_D if the value r_i,j = ||di−qj2||²

||d_i−q_j||² (2)

(13)

is larger than a threshold t_ratio = 1.6², where d_i is the i-th descriptor vector from the database image, andq_j, q_j2 are two closest descriptor vectors from the query image. Fi- nally, it ranks all the images, depending on the obtained similarity scores. The higher rank, the higher probability that the depicted animal is represented in the database image.

The second version (one-vs-many) compares each descriptor from the pending image with each one from the database image. This process is performed for each image, but it gen- erates scores for onlyk approximate nearest neighbor descriptors, wherek is a threshold value. In the end, all scores are summed up in order to obtain the overall similarity score for each individual from the database. Thus, one-vs-many outperforms one-vs-one in both speed and accuracy due to the fact that some descriptors are not taken into account and, therefore, unnecessary matching calculations are omitted.

Figure 3.Example of the "HotSpotter" result. The ellipses are "hot spots" [3].

2.3.2 Algorithm for Saimaa ringed seal re-identification

Another technique aimed at the Saimaa ringed seal re-identification was proposed in [17].

Basically, the whole algorithm can be split into the following steps:

1. segmentation of the seal;

2. cropping the seal to the bounding box;

3. pattern extraction and its division into small patches;

(14)

4. searching for the most similar patches in the database.

After the segmentation of the depicted seal has been performed, two postprocessing steps aimed at closing the holes and smoothing the borders are applied. Without these steps, the pelage pattern is likely to be disrupted. The pattern extraction stage allows to avoid such problems as the noise presence in the image, learning the unnecessary features and the abundance of data needed for training. An important issue within the task is that the observed parts of the pattern of one seal may vary a lot, depending on the angle of view. Thus, the division into patches allows to overcome this problem. Finally, the Triplet Neural Network is utilized for calculation the similarities between the patches. During its training, it receives: the anchor — the pattern patch, the positive — another patch from the same seal, and the negative — patch from some other individual. These samples allow the network to encode them in such a way thatL₂ metric distance between the anchor and the positive is smaller than the distance between the anchor and the negative by some threshold value. The scheme of the whole method is presented in Fig. 4 [17].

Figure 4. The process of the algorithm [17].

(15)

3 CONVOLUTIONAL NEURAL NETWORKS FOR IN- STANCE SEGMENTATION

3.1 Image segmentation

The animals depicted in the image are the only necessary parts for the re-identification since there is no need to process background. Thus, image segmentation is an inherent part of the re-identification pipeline. Basically, it is a process of dividing an image into parts called "segments". In other words, each image pixel receives a label indicating which segment the pixel belongs to. This technique is usually applied for such objectives as image compression or object recognition since it is extremely inefficient to process the entire image in these cases. That is, image segmentation is a highly profitable way of reducing time and resource expenses in further image processing. Typically, image segmentation methods divide an image into several parts based on certain image features, such as pixel intensity value, color and texture [18].

Basically, image segmentation can be split into supervised and unsupervised. The former approach means that each pixel is manually labeled by human, whereas the latter performs segmentation without knowing the ground truth classes. In terms of outputs, image segmentation is usually divided into semantic and instance segmentation. A semantically segmented image is an image in which each pixel belongs to one of the following classes:

background, class1, class2, . . . , classn. Instance segmentation, in turn, allows to label pixels in more classes differentiating instances of the same class: background, object 1, object2, . . . , objectn[19].

Thus, a tool is required in order to perform image segmentation. Today, CNNs are the state-of-the-art techniques in this field since they have proven their usefulness in this field [7], [20].

3.2 Convolutional neural networks

Although the first CNN was described in 1998 it has become a widespread technique for image segmentation only in the last eight years [4]. The reason why CNNs are so beneficial for image segmentation is that they are able to extract features from images with minimal preprocessing.

(16)

Basically, CNN operates in three stages [21]:

1. convolution;

The ’filter’ is passed over the image, viewing several pixels at a time. Convolution can be defined as a dot product of two matrices with subsequent summations. The first matrix contains original pixel values, and another one is the filter. After the completion of the multiplication, the results are summed into one number which is the first element of the feature map. Multiplications and summations (with the shifted filter) are repeated until the entire image is covered by the filter. Finally, a feature map is completed . The number and size of the filters are the parameters to be determined.

2. pooling;

’Pooling’ (also known as subsampling) is a technique allowing to extract the most important parts from the feature map obtained after the convolutional layer. This layer uses another filter for extracting one value from each group of the feature map values. Max pooling is the most common type of pooling that selects the maximum number from each group. Generally, the pooling layer preserves the most important information, and cuts off the less important. On top of that, it speeds up the training process due to the reduction of feature maps’ dimensions.

3. fully-connected neural network;

The last CNN layer is a multilayer perceptron. It takes the results of the previous layers as input and returns probabilities for each label.

The first two steps can be repeated several times in order to pass more feature maps to the fully-connected neural network as shown in Fig. 5 [21].

Figure 5. Original CNN structure [21].

(17)

In the example presented in Fig. 5, the input is a 32×32 grayscale image which goes through the first convolutional layer with six filters of size 5×5and the stride equal to 1, changing the dimension from 32×32 to 28×28×6. After that, the pooling layer with a filter of size 2× 2 and the stride of 2 reduces the dimension to 14× 14× 6.

After convolutional and pooling steps have been repeated one more time, the dimension is 5×5×16. The fifth layer is a fully connected convolutional layer with 120 feature maps of size 1×1. Each of the120 units is connected to all the 400 nodes (5·5·16) from the previous layer. The following layer is also fully connected containing84units.

The final layer is the softmax output layer with10possible values corresponding to the digits from0to9. The softmax probability layer means that the sum of all the values in this layer is1.

In terms of the image segmentation, CNN typically receive an input image of three dimensions: height, width and number of channels. Thus each feature map before the first convolution consists of intensity values for red, green and blue colors, for example. There is a wide range of possible filters for this case, which can extract various types of features, such as edges or color intensities. So, CNN-based approaches fit to the task of the segmentation of images with the Ladoga ringed seals.

3.3 Semantic image segmentation

Before describing the instance segmentation, the semantic segmentation should be considered as it is a similar, but simpler task. Semantic image segmentation is a type of the image segmentation aimed at labeling each image pixel in such a way that pixels belong- ing to one class are labeled identically. The example is shown in Fig. 6 [22].

(a) (b)

Figure 6. Semantic segmentation: (a) Original image; (b) Segmented image [22].

(18)

The encoder-decoder network is one of possible CNN-based techniques for semantic segmentation. Essentially, this architecture contains two neural networks:

• encoder: produces a feature vector from the input image (that is, combining of features at several levels);

• decoder: produces a semantic segmentation mask utilizing the obtained feature vector (that is, decoding of features combined by the encoder).

For example, SegNet [23] is a method based on the encoder-decoder network. The sig- nificant characteristic of SegNet is that there are no fully connected layers, thus it is just convolutional. The encoder network consists of 13 convolutional layers, each of which has a corresponding decoder layer. Therefore, there are 26 convolutional layers in total.

Finally, the soft-max classifier takes the decoder output as an input in order to perform the pixel-wise classification. The SegNet architecture is shown in Fig. 7 [23].

Figure 7.SegNet architecture [23].

Semantic segmentation is a difficult task because of a fairly common problem: the color and the pattern of the animal’s fur are similar to the background coloration. So, one of the possible ways to segment the animals presented in the image is to divide this image into segments or superpixels [2]. After this, it is possible to utilize two classes: superpixels which belong to the animal or background. Ideally, these classes should not overlap.

3.4 Instance segmentation

Instance segmentation is the problem of identifying and highlighting individual instances of one or more semantic classes in an image as shown in Fig. 8 [22].

(19)

(a) (b)

Figure 8. Instance segmentation: (a) Original image; (b) Segmented image [22].

Comparing the Fig. 6 to the Fig. 8, it can be seen that cubes presented in the first image belong to one class, whereas cubes in the second one are labeled differently.

Usually the number of such instances is unknown. Thus, this task can be considered as a more complicated analogue of the semantic segmentation problem [19]. The instance segmentation issue can be partitioned into two subtasks [7]:

1. object detection;

All individual objects must be classified, and all object instances must be localized using bounding boxes.

2. segmentation of each instance.

Each pixel must be labeled with a class from a fixed set. It is also noteworthy that object instances do not belong to the same class.

3.4.1 MaskLab

MaskLab [20] is a widespread instance segmentation technique. Generally, it is based on three blocks:

1. bounding box prediction;

2. differentiation of objects of different classes using the semantic segmentation;

3. separation of instances of the same class using the direction prediction.

The first step is performed by Faster R-CNN that is described in Section 4.1.3. The second

(20)

and the third steps are based on logits. These logits are calculated by additional convolution of size 1×1, which is added after the last feature map generated by the feature extractor. Thus, logits for classification and logits for predicting the direction toward the center of an instance are obtained. These values allow to perform the instance segmentation for each RoI. To be more specific, Faster R-CNN provides the class prediction, and thereby, logits of this class are chosen to determine the region that is cropped. After that, the direction pooling is applied in the RoI. Finally, the two features are combined and passed through1×1convolution for the foreground/background segmentation. The architecture of MaskLab is presented in Fig. 9 [20].

Figure 9.MaskLab architecture [20].

3.4.2 InstanceCut

Not all methods operate in the two aforementioned steps. For example, InstanceCut [24]

can be divided into the following stages:

1. semantic segmentation block calculates per-pixel semantic class scores. Basically, these scores are log probabilities for each class label and each pixel in the input image. In other words, how likely it is that a certain pixel belongs to every class of the set;

2. instance-aware edge detection block calculates log probabilities of an object boundary for all the pixels. In other words, how likely it is that a certain pixel belongs to an object boundary. These scores are independent of classes;

(21)

3. image partitioning block performs instance segmentation basing on the obtained scores.

Assigning every image pixel with the log probability value is a very resource consuming task. It is possible to optimize the method greatly by utilizing the superpixel approach.

The steps of InstanceCut are presented in Fig. 10 [24].

Figure 10.The process of InstanceCut [24].

Mask R-CNN [7] is one of the most commonly used methods for CNN-based instance segmentation. It was chosen for the task solution due to its high level of results produced.

Its description is presented in Chapter 4.

(22)

4 MASK R-CNN FOR SEAL SEGMENTATION

Mask R-CNN has been selected for the Ladoga ringed seal instance segmentation as a technique promising good results. The reason is that it was tested on a huge data set containing images of various content [7], that, among other things, included a wide range of animals. Thus, there were no obstacles to apply it to the custom data set of the Ladoga ringed seals. Considering the fact that Mask R-CNN is based on the object detection technique named Regions with CNN features (R-CNN), it is described firstly along with its improved versions.

4.1 Regions with CNN features

4.1.1 R-CNN

R-CNN [4] is one of the most widespread algorithms which performs object detection, founding on RoIs. Generally, it can be split into the following steps:

1. generation of region proposals;

2. obtaining a feature vector from each region utilizing a CNN;

3. classification of each region with linear SVMs (Support Vector Machines).

The first step is carried out with the selective search [25], although R-CNN is flexible in terms of algorithms generating region proposals. It can output thousands of RoIs.

In the second stage, AlexNet proposed by A. Krizhevskyet al. [11] is used. It extracts a feature vector of size equal to4096from each RoI. Regarding its architecture, it consists of seven layers: five convolutional and two fully-connected, although there are three fully- connected ones in the original paper.

After that, SVMs, that have been trained for each class independently, take every feature vector as an input and produce confidence scores. Finally, regions having Intersection over Union (IoU) overlaps with ones having higher scores are rejected. This action is called greedy non-maximum suppression and is performed for each class independently.

In addition to the class, R-CNN predicts a bounding box with four offset values. The

(23)

architecture of R-CNN is presented in Fig. 11 [4]. The concept of IoU is described in Section 5.2.

Figure 11.R-CNN architecture [4].

Although R-CNN shows high level of accuracy in detecting objects, there are three no- table shortcomings [5]:

1. training is a multi-stage pipeline;

2. training is expensive in space and time;

3. object detection is slow.

Generally, the main reason for the slowness is that R-CNN applies CNN to each region proposal.

4.1.2 Fast R-CNN

R. Girshick proposed Fast R-CNN [5] that fixes disadvantages of R-CNN. The steps of the algorithm are as follows:

1. several convolutional and max pooling layers are applied to the input image in order to generate a feature map;

2. RoI pooling layer is applied to each object proposal in order to extract feature vectors from the feature map;

3. a sequence of fully-connected layers takes each feature vector as an input.

(24)

Basically, the first step is a Fully Convolutional Network (FCN) which is quite similar to the classical CNN, but with the ability to take an arbitrary-sized image as an input. This is not possible for CNNs because of the final fully-connected layer demanding a fixed- length input. Thus, FCNs consist only of convolutional and pooling layers that allow processing images of various sizes.

The RoI pooling layer is based on max pooling of RoIs’ sub-windows. To be more specific, a RoI can be considered as a window of sizeh×w. Having set hyper-parameters H and W which are independent of RoIs, the RoI is divided into sub-windows of approximate size h/H ×w/W. After that, max pooling is applied to every sub-window, producing a feature map of sizeH×W.

Finally, after a feature vector has been processed by fully-connected layers, there are two output vectors: softmax probabilities over all classes plus background and bounding- box regression offsets (four real-valued numbers). The architecture of Fast R-CNN is presented in Fig. 12 [5].

Figure 12.Fast R-CNN architecture [5].

4.1.3 Faster R-CNN

The previous two methods need an algorithm for generating region proposals, for example selective search. The problem is that such algorithms are very time-consuming. S. Renet al. [6] proposed a new concept for finding RoIs called Region Proposal Network (RPN) that shares convolutional layers with Fast R-CNN. Thus, Faster R-CNN is composed of two modules:

(25)

1. RPN;

It takes an image of any size as an input and outputs a set of rectangular object proposals, each with an objectness score. "Objectness" measures membership to a set of object classes vs. background.

2. the Fast R-CNN detector that utilizes the proposed regions, namely extracts features from each region proposal and performs the classification and the bounding-box regression.

If RPN and Fast R-CNN are trained independently, they will change their convolutional layers in various ways. In this case, the necessary unified neural network cannot be cre- ated. Thus, a technique that allows sharing convolutional layers is required. A possible way to train networks with features shared was introduced in [6]. It is known as alternat- ing training:

1. RPN is trained;

2. Fast R-CNN is trained utilizing the proposals of RPN;

3. the network tuned by Fast R-CNN is used to initialize RPN;

4. the process is iterated.

The architecture of Faster R-CNN is presented in Fig. 13 [26].

4.2 Mask R-CNN

Generally, Mask R-CNN [7] performs the same two-stage procedure as Faster R-CNN does. The first step preserves the RPN procedure. However, at the second stage, a branch is added to predict segmentation masks for each region of interest. This mask branch is a FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel man- ner. Thus, there are three outputs: class, box offset and binary mask for each RoI. The architecture is shown in Fig. 14 [27].

One of the key elements of Mask R-CNN is the RoI alignment layer. This layer is slightly altered RoI pooling layer presented in [5]. The problem with the RoI pooling is that it divides region proposals into sub-windows using rounding, that is, the sizes of such sub- windows are different. Although this action does not affect the classification and detection

(26)

Figure 13.Faster R-CNN architecture [26].

tasks, it results in emerging misalignments that have a negative impact on the prediction of pixel-wise masks. The solution is to get rid of this sharp approximation by utilizing bilinear interpolation [7]. This technique allows to obtain sub-windows of the same size, therefore, the output will be more accurate after max/average pooling. The architecture of Mask R-CNN is presented in Fig. 14 [27].

4.3 Network architecture

In Mask R-CNN, the backbone network is a CNN which is used for feature extraction from the input image. K. He et al. suggested six various backbone architectures [7].

These backbone architectures are denoted as "network-depth-features". The corresponding models are trained with ResNet [13] of depth equal to50and101layers.

(27)

Figure 14.Mask R-CNN architecture [27].

The three types of the features are as follows:

1. Feature Pyramid Network (FPN) [28];

The architecture with lateral connections, that allows to generate a feature pyramid from a single-scale input within network.

2. C4;

The original Faster R-CNN with ResNet extracted features from the final layer of the fourth stage.

3. DC5;

Similar to C4 but with features extracted from the fifth stage with dilated convolution.

The authors determined that the deeper the network, the better the performance result is. That is why, ResNet is highly profitable within this method since it overcomes the degradation problem [13]. This problem is as follows: when a deeper network is about to converge and its depth increases, the accuracy gets saturated and drops down promptly.

The proposed solution is called deep residual learning.

Let us assume that there are multiple stacked layers, which receive inputx, and return the desired mappingH(x). The authors suggest to allow these stacked layers fit to a residual mappingF(x) = H(x)−xthat can be transformed into H(x) = F(x) +x. The reason

(28)

for this action is that it is easier to optimize the residual mappingF(x)than the original H(x). In terms of CNN, the formulaF(x) +xcan be realized by utilizing the shortcut connections that skip one or multiple layers. In ResNet such shortcut connections perform just identity mapping, that is, they do not alterx, and their output is simply added to the output of stacked layers. The essence of a shortcut connection is presented in Fig. 15.

Figure 15.Residual learning scheme [13].

In order to apply Mask R-CNN for instance segmentation of the Ladoga ringed seals, existing models trained on large amount of different object categories and transfer learning can be utilized. In practice, during the training on the custom data set the final network layer should be altered to detect only one class "seal".

(29)

5 EXPERIMENTS

5.1 Data

There were several available data sets containing images of the Ladoga ringed seals that were photographed either by wildlife camera traps, or by humans. 100and50images were selected for the training and validation purposes, respectively. The images were selected in such a way that they were as dissimilar as possible in order to train the models better.

Thus, they varied a lot by the quality and by the number of individuals depicted, ranging from 1 to 19. After the data sets have been organized, all the images were manually annotated. The training set was annotated for training the models, and the validation set was annotated for evaluation the results. Few challenging examples are presented in Fig. 16.

Figure 16.Examples of data set images.

To perform the manual annotation of the seal contours, an open source graphical image annotation tool "VGG Image Annotator (VIA)" [29] was used. This application allows to label the seal contours as it is shown in Fig. 17.

(30)

Figure 17.Annotated seals with the use of online version VIA.

5.2 Evaluation criteria

The evaluation of accuracy was performed basing onF₁score that combines two metrics:

precision and recall. The bigger value ofF₁, the more precise the result is. The formula forF1is as follows:

F₁ = 2· precision·recall

precision+recall, (3)

where

precision = T P

T P +F P (4)

recall = T P

T P +F N (5)

T P,F P andF N stand for True Positive, False Positive and False Negative, respectively.

The concept of IoU must be described in order to give the definitions to these parameters.

Let us suppose that there is one object of interest in an image. Thus, there is the ground truth boundary which is annotated manually, and the detected boundary. Then, IoU is calculated in the following way:

IoU = intersection_area

union_area (6)

(31)

Let us assume, that the IoU threshold is set to bet. Thereby, object is counted as:

• true positive ifIoU ≥t;

• false positive ifIoU < t;

• false negative if ground truth is present, but model did not detect it.

True negative is every non ground truth object that is not detected, and therefore it is useless in evaluation. Since Mask R-CNN outputs the bounding box and the segmentation mask for each detected instance, the IoU can be calculated for both of them. One more important accuracy evaluation term is the mean Average Precision (mAP) that is the average precision for the IoU thresholdt = 0.5. . .0.95with a step size of 0.05.

In terms of the counting evaluation, Pearson correlation coefficients were used. It is calculated in the following way:

ρ_X,Y = cov(X, Y)

σ_Xσ_Y , (7)

where cov(X, Y) is the covariance of variables X and Y; σ_X and σ_Y are the standard deviations of X and Y, respectively. Basically, it shows the type of correlation, or its absence between the two variables. If ρ < 0, then there is negative correlation between the considered variables. It means that the larger value of one variable is, the smaller value of another variable is.ρ= 0means that there is no linear correlation. Ifρ >0, then there is positive correlation, meaning that the larger value of one variable is, the larger value of another variable is. The boundary values for the coefficient are −1and 1, representing totally negative and totally positive correlation, respectively.

5.3 Description of experiments

A number of pretrained models with various backbone architectures were applied in order to evaluate their performance and compare the results. The following backbone architectures were evaluated and compared:

1. ResNet-50-FPN;

2. ResNet-50-C4;

(32)

3. ResNet-50-DC5;

4. ResNet-101-FPN;

5. ResNet-101-C4;

6. ResNet-101-DC5.

Experiment 1 was held to compare the aforementioned models. After they have been trained and applied to the validation set, their performance was evaluated basing onF₁ scores and mAPs. Experiment 2 was performed in order to understand how the testing threshold influences F1 score, and to determine the most favorable value for it. This threshold defines whether the detected object is a seal, or not. Experiment 3 aimed at obtaining the numbers of seals detected per image. Besides that, Pearson correlation coefficients were calculated for each model in order to determine if there is linear correlation between the numbers of ground truth and detected seals.

All experiments were conducted with the same conditions, i.e., the training and validation sets did not vary. Speaking of the practical part of the experiments, Detectron2 [30] was utilized which is the framework containing the state-of-the-art object detection, semantic/instance segmentation algorithms and pretrained models.

5.4 Results

5.4.1 Experiment 1

The comparisons of different architectures based on mAPs andF₁ scores are presented in Table 1 and Table 2, respectively.

The same image segmented by each of the models is presented in Fig. 18.

5.4.2 Experiment 2

The graphs showing the dependency of architectures on the testing threshold for segmentation and bounding boxes are presented in Fig. 19 and Fig. 20, respectively. The threshold values are located on theX-axis, whereasF₁ score is on theY-axis.

(33)

Table 1.Accuracy of architectures based on mAPs with the testing threshold equal to0.8.

Architecture mAP

Segmentation Bounding box

ResNet-50-FPN 66.9% 61.5%

ResNet-50-C4 68.5% 59.9%

ResNet-50-DC5 68.0% 64.0%

ResNet-101-FPN 71.1% 66.4%

ResNet-101-C4 66.8% 52.5%

ResNet-101-DC5 69.9% 62.9%

Table 2.Accuracy of architectures based onF₁ scores with the testing threshold equal to0.8.

Architecture F1 score

Segmentation Bounding box

ResNet-50-FPN 0.259 0.233

ResNet-50-C4 0.270 0.239

ResNet-50-DC5 0.273 0.261

ResNet-101-FPN 0.290 0.276

ResNet-101-C4 0.263 0.212

ResNet-101-DC5 0.280 0.261

5.4.3 Experiment 3

There are two figures for each model. The first figure shows two plots standing for the numbers of ground truth and detected seals per image. 50images from the validation set are located on theX-axis, and the amounts of seals are located on theY-axis. The second figure is the scatter plot showing the relation between the numbers of ground truth and detected seals. The graphs for all models are presented in Figs. 21-32. The correlation coefficients for various architectures are presented in Table 3.

Table 3.Pearson correlation coefficients for architectures with the testing threshold equal to0.8.

Architecture ρ ResNet-50-FPN 0.70

ResNet-50-C4 0.87 ResNet-50-DC5 0.91 ResNet-101-FPN 0.81 ResNet-101-C4 0.89 ResNet-101-DC5 0.97

(34)

(a) (b)

(c) (d)

(e) (f)

Figure 18.Examples of data set images: (a) ResNet-50-FPN; (b) ResNet-50-C4;

(c) ResNet-50-DC5; (d) ResNet-101-FPN; (e) ResNet-101-C4; (f) ResNet-101-DC5.

(35)

Figure 19.F1scores for segmentation for each architecture.

Figure 20.F1scores for bounding boxes for each architecture.

(36)

Figure 21.Number of detected seals by ResNet-50-FPN.

Figure 22.Relation of ground truth and detected numbers of seals by ResNet-50-FPN.

ρ= 0.70

(37)

Figure 23.Number of detected seals by ResNet-50-C4.

Figure 24.Relation of ground truth and detected numbers of seals by ResNet-50-C4.

ρ= 0.87

(38)

Figure 25.Number of detected seals by ResNet-50-DC5.

Figure 26.Relation of ground truth and detected numbers of seals by ResNet-50-DC5.

ρ= 0.91

(39)

Figure 27.Number of detected seals by ResNet-101-FPN.

Figure 28.Relation of ground truth and detected numbers of seals by ResNet-101-FPN.

ρ= 0.81

(40)

Figure 29.Number of detected seals by ResNet-101-C4.

Figure 30.Relation of ground truth and detected numbers of seals by ResNet-101-C4.

ρ= 0.89

(41)

Figure 31.Number of detected seals by ResNet-101-DC5.

Figure 32.Relation of ground truth and detected numbers of seals by ResNet-101-DC5.

ρ= 0.97

(42)

6 DISCUSSION

6.1 Current study

Today more and more animals may disappear due to the climate change and human activities. Tracking of animal individuals may help to save them. That is why animal re-identification, which is impossible without instance segmentation, is a highly signifi- cant task, and it is not going to become less demanded. Unfortunately, there is no unified solution to this problem. So, each animal species, especially endangered ones, need such systems. Although there are already working systems, this is not enough.

This work focuses on the problem of instance segmentation of the Ladoga ringed seals.

To achieve positive results, one can apply a wide range of algorithms. However, Mask R-CNN is one of the most advantageous methods since it has been shown to provide the state-of-the-art performance in similar tasks.

The objectives stated in Section 1.2 were fully completed. The data set of the Ladoga ringed seals was manually annotated and successfully utilized for the training and validation purposes. Additionally, various architectures of Mask R-CNN were trained, evaluated and compared. Experiment 1 showed that the highestF₁score and mAP obtained for:

segmentation are0.290and71.1%; for bounding boxes are0.276and66.4%, respectively.

These results were provided by ResNet-101-FPN that outperformed other architectures in accuracy. The reason for the success of this particular model is that ResNet copes with high depth very efficiently. Moreover, FPN helped to outperform C4 and DC5 with 101 layers, although it showed the worst accuracy with50layers.

Experiment 2 was held in order to determine the dependency of the models on the testing threshold, comparing theirF₁ scores. The graph for segmentation showed that just three models (ResNet-50-C4, ResNet-101-C4, ResNet-101-FPN) reached the local maximum in the value of0.8, whereas the remaining ones (ResNet-50-DC5, ResNet-50-FPN, ResNet-101-DC5) were at the point of local minimum. The graph showingF₁scores for bounding boxes is unequivocal in this sense: all models are at the local maximum when the threshold is equal to0.8. There is no need in testing lower thresholds since the models would detect higher numbers of false positive seals, and it is already happening in the value of0.8in some cases. Also, the threshold can not be set to1since there will be no detected instances, because it means that the bounding box values and the segmentation mask must be precisely the same, and the probability of this is extremely low. Overall,0.8

(43)

appears to be the most suitable value for the testing threshold. Moreover, this experiment explicitly shows the predominance of ResNet-101-FPN by accuracy.

Experiment 3 was conducted in order to obtain the information on number of the detected seals per image. The obtained plots showed that the greatest difference between the numbers of ground truth and predicted seals is in the images where a high number of seals is depicted, more than ten. It is clear since such images are the most challenging due to the fact that the Ladoga seals tend to crowd. In some images, number of the detected instances exceeds the actual number. This can be explained by the bad quality, and, particularly in these data sets, models typically detect stones as the Ladoga seals. Also, it is possible that one seal might be detected as two or more because of its position, but this mostly happens when the testing threshold is less than0.8. On top of that, Pearson correlation coefficients showed positive correlation for all models meaning that when the ground truth number of the depicted seals increases, models also tend to detect more seals. In terms of the counting experiment, ResNet-101-DC5 showed the highest accuracy with the correlation coefficientρ= 0.97.

6.2 Future work

Although, Mask R-CNN achieved good results, it does not mean that this technique is the best one. There are other methods in this field that need to be studied and applied to the considered task in order to compare and conclude whether Mask R-CNN is the most beneficial approach or not. Besides that, the amount of training images should be increased to determine whether it can improve the accuracy of the method. In addition, more classes should be added to the training set since current models can detect only one class "seal". This can significantly improve the accuracy since sometimes some objects are mistakenly detected as the Ladoga ringed seals. Such "to add" classes should be the most frequent objects within the data set images, for example "bird", "stone", "bush",

"tree".

Obviously, this was just the first step towards the Ladoga ringed seal re-identification.

Later, a method extracting the pelage pattern is needed. The one that has already been implemented for the Saimaa ringed seals might be utilized since the circumstances of the task are the same. Finally, the algorithm of matching individuals has to be implemented.

Again, the identical problem has already been solved for the Saimaa species, and it re- quires testing with the Ladoga seals.

(44)

7 CONCLUSION

The foremost goal of the Master’s thesis was to develop a system which is able to perform the Ladoga ringed seal instance segmentation. At the beginning, the task was divided into three sub-tasks: preparation of the training data using the given data sets, that is, manual annotation of seals’ contours; implementation, training and evaluation of the existing instance segmentation methods; and counting the numbers of the detected seals in the images. The training and testing data sets were annotated and successfully utilized during the training and validation processes.

Mask R-CNN was selected for the Ladoga ringed seal instance segmentation because it had proved to be in the top of the most beneficial approaches in this field. Gener- ally, it succeeded with the task showing good performance of mAP within the range of 52.5−66.4% for bounding boxes, and 66.8−71.1% for segmentation. Additionally, F₁ scores varied in the range of 0.212 −0.276 and 0.259 −0.290 for bounding boxes and segmentation, respectively. Results of the experiments showed that different backbone architectures (ResNet-50-FPN, ResNet-50-C4, ResNet-50-DC5, ResNet-101-FPN, ResNet-101-C4, ResNet-101-DC5) matter since the performance varies depending on the architecture. After the evaluation has been performed, the Mask R-CNN architecture ResNet-101-FPN was defined as the most favorable approach to the Ladoga ringed seal instance segmentation problem since it showed the best results in accuracy comparing to the other architectures. However, ResNet-101-DC5 appeared to be the most suitable architecture for counting the seals.

(45)

REFERENCES

[1] Stefan Schneider, Graham Taylor, Stefan Linquist, and Stefan Kremer. Past, present, and future approaches using computer vision for animal re-identification from camera trap data. arXiv:1811.07749, 2018.

[2] Artem Zhelezniakov, Tuomas Eerola, Meeri Koivuniemi, Miina Auttila, Riikka Lev- änen, Marja Niemi, Mervi Kunnasranta, and Heikki Kälviäinen. Segmentation of Saimaa ringed seals for identification purposes. InInternational Symposium on Vi- sual Computing (ISVC), pages 227–236, 2015.

[3] Tina Chehrsimin, Tuomas Eerola, Meeri Koivuniemi, Miina Auttila, Riikka Levä- nen, Marja Niemi, Mervi Kunnasranta, and Heikki Kälviäinen. Automatic individual identification of Saimaa ringed seals. IET Computer Vision, 12(2):146–152, 2018.

[4] Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 580–587, 2014.

[5] Ross Girshick. Fast R-CNN. Inthe IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, 2015.

[6] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(6):1137–1149, 2017.

[7] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN.

In the IEEE International Conference on Computer Vision (ICCV), pages 2980–

2988, 2017.

[8] Ekaterina Nepovinnykh, Tuomas Eerola, Heikki Kälviäinen, and Gleb Radchenko.

Identification of Saimaa ringed seal individuals using transfer learning. In the 19th International Conference on Advanced Concepts for Intelligent Vision Systems (ACIVS), pages 211–222, 2018.

[9] Tilo Burghardt, Janko Calic, and Barry T. Thomas. Tracking animals in wildlife videos using face detection. InEuropean Workshop on the Integration of Knowledge, Semantics and Digital Media Technology, 2004.

[10] Weiwei Zhang, Jian Sun, and Xiaoou Tang. From tiger to panda: Animal head detection. IEEE Transactions on Image Processing (TIP), 20(6):1696–1708, 2011.

(46)

[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. Inthe 25th International Conference on Neural Information Processing Systems (NIPS), pages 1097–1105, 2012.

[12] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.

[14] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. Inthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[15] Sally A. Mizroch, Judith A. Beard, and Macgill Lynde. Computer assisted photo- identification of humpback whales. Report of the International Whaling Commis- sion, pages 63–70, 1990.

[16] Jonathan P. Crall, Charles V. Stewart, Tanya Y. Berger-Wolf, Daniel I. Rubenstein, and Siva R. Sundaresan. Hotspotter — patterned species instance recognition. In the IEEE Workshop on Applications of Computer Vision (WACV), pages 230–237, 2013.

[17] Ekaterina Nepovinnykh, Tuomas Eerola, and Heikki Kälviäinen. Siamese network based pelage pattern matching for ringed seal re-identification. Inthe IEEE Winter Conference on Applications of Computer Vision (WACV) Workshop, 2020.

[18] Dilpreet Kaur and Yadwinder Kaur. Various image segmentation techniques: A review.International Journal of Computer Science and Mobile Computing (IJCSMC), 3(5):809–814, 2014.

[19] Victor Kulikov, Victor Yurchenko, and Victor Lempitsky. Instance segmentation by deep coloring. arXiv:1807.10007, 2018.

[20] Liang-Chieh Chen, Alexander Hermans, George Papandreou, Florian Schroff, Peng Wang, and Hartwig Adam. MaskLab: Instance segmentation by refining object detection with semantic and direction features. Inthe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

[21] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–

2324, 1998.

(47)

[22] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea, Victor Villena- Martinez, and José García Rodríguez. A review on deep learning techniques applied to semantic segmentation. arXiv:1704.06857, 2017.

[23] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39(12):2481–2495, 2017.

[24] Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and Carsten Rother. InstanceCut: From edges to instances with multicut. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5008–

5017, 2017.

[25] Jasper Uijlings, Koen van de Sande, Theo Gevers, and Arnold Smeulders. Selective search for object recognition. International Journal of Computer Vision (IJCV), 104:154–171, 2013.

[26] Chi C. Nguyen, Giang S. Tran, Thi P. Nghiem, Nhat Q. Doan, Damien Gratadour, Jean C. Burie, and Chi M. Luong. Towards real-time smile detection based on faster region convolutional neural network. Inthe 1st International Conference on Multimedia Analysis and Pattern Recognition (MAPR), pages 1–6, 2018.

[27] Renu Khandelwal. Computer vision: Instance segmentation with Mask R-CNN. https://towardsdatascience.com/computer-vision- instance-segmentation-with-mask-r-cnn-7983502fcad1, 2019.

[Online; accessed April, 26, 2020].

[28] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 936–944, 2017.

[29] Abhishek Dutta, Ankush Gupta, and Andrew Zissermann. VGG image annotator (VIA). http://www.robots.ox.ac.uk/~vgg/software/via/via- 1.0.6.html, 2016. [Version: 1.0.6, Accessed: April, 28, 2020].

[30] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick.

Detectron2. https://github.com/facebookresearch/detectron2, 2019.

Instance segmentation of Ladoga ringed seals