Metric learning based pattern matching for species agnostic animal re-identification

(1)

Computational Engineering and Technical Physics Computer Vision and Pattern Recognition

Ola Badreldeen Bdawy Mohamed

METRIC LEARNING BASED PATTERN MATCHING FOR SPECIES AGNOSTIC ANIMAL RE-IDENTIFICATION

Master’s Thesis

Examiners: Professor Heikki Kälviäinen Associate Prof. Tuomas Eerola Supervisors: M.Sc. Ekaterina Nepovinnykh

Associate Prof. Tuomas Eerola Professor Heikki Kälviäinen

(2)

Lappeenranta-Lahti University of Technology LUT School of Engineering Science

Computational Engineering and Technical Physics Computer Vision and Pattern Recognition

METRIC LEARNING BASED PATTERN MATCHING FOR SPECIES AGNOS- TIC ANIMAL RE-IDENTIFICATION

Master’s Thesis 2021

50 pages, 34 figures, 1 table.

Examiners: Professor Heikki Kälviäinen Associate Prof. Tuomas Eerola

Keywords: Re-identification, Metric learning, Camera traps, computer vision, pattern recognition

In the active effort to monitor and protect endangered animal species, modern technology is replacing the previously used conventional techniques of tracking using GPS or tagging which are considered invasive in nature. The non-invasive technology such as camera traps collects a large amount of data remotely, enabling the use of computer vision techniques to perform the analysis including re-identification of animal individuals. The re-identification of the animal individuals can be done by training a convolutional neural network to measure the similarity of fur pattern between images. This thesis extends the re-identification method by building a fully automated and species agnostic framework, where given an image, the framework is capable of detecting the animal in the image and their distinguishing fur or pelage pattern. This pattern is then used to match the animal to the most similar individual in the dataset of known individuals. The framework achieved an accuracy up to89.1%and82.1%on zebra and giraffe datasets respectively.

(3)

I would like to thank everyone who has had an impact on my educational career. First and foremost, I’d like to express my gratitude to my academic advisers, who have patiently led and supported me during this study process. Thank you for your unwavering support.

Second, I’d like to thank my family for their love and understanding. Without you, I would not have been able to completed this journey.

Lappeenranta, June 1, 2021

(4)

LIST OF ABBREVIATIONS

3D Three Dimensional

AMWR Automatic Minke Whale Recognizer ArcFace Additive Angular Margin Loss ASPP Atrous Spatial Pyramid Pooling CBC Cross-Border Cooperation CNN Convolutional Neural Network DBN Deep Belief Network

DCNN Deep Convolutional Neural Network DNA Deoxyribonucleic Acid

FCN Fully Convolutional Network FPN Feature Pyramid Network

FV Fisher Vector

GMM Gaussian Mixture Model GPS Global Positioning System HMM Hidden Markov Model ICP Iterative Closest Point KNN K-Nearest Neighbours

PCA Principal Component Analysis RANSAC Random Sample Consensus ResNet Residual Neural Network

SIFT Scale-invariant Feature Transform TNN Triplet Neural Network

UEF University of Eastern Finland

(7)

1 INTRODUCTION

1.1 Background

Animal re-identification plays a large role in the study of population dynamics, which in turn contributes to the monitoring of wildlife populations and the conservation of endangered species. Up until recently, biologist have been using manual re-identification techniques. Tagging, banding and Deoxyribonucleic Acid (DNA) analysis of follicles or feces have been used, and are capable of achieving high precision in re-identifying animal individuals [1]. Re-identification techniques that require physical interaction will add additional stress on the animals and result in behavioral changes. A less intrusive and a more affordable way to monitor animals is by collecting image data using camera traps that generates a huge amount of data [1]. Computer vision techniques provide a useful tool to analyse such data.

Relying on the human eye to re-identify each animal using the data collected by the camera traps, is not only time-consuming due to the large amount of data, but also prone to human error since this task requires a highly qualified expert to perform the identification with high accuracy. These limitations can be overcome using machine learning methods which are known to work better on large datasets unsuitable for manual processing [2].

In certain animal species, the skin, fur or feather function like the human fingerprints, as these animals have a distinctive pattern that can be used to re-identify them through the aid of computer vision as shown in Figure 1 [3].

Figure 1.Animal skin, fur and feather patterns from different species. [3]

(8)

There are two approaches for the implementation of animal re-identification with the aid of computer vision [4]. The first approach is pixel-based where two images are compared pixel by pixel to decide the similarities. This approach is highly sensitive to variation in image quality, orientation or the way the image was cropped, making it less fit to be used on data sets produced by camera traps because of their inherited low quality.

The second approach is feature-based, where the algorithms focus on finding distinct characteristics on each animal such as spots or stripes. While taking the geometrical aspects into consideration, these characteristics are then used to find the most similar matches from the already known individuals in the databese [2].

This thesis aims to generalize the existing method [5] for the re-identification of the Saimaa ringed seals, to additionally re-identify other animal species which have distinguishing fur, pelage or skin pattern. After the detection of the animal in the image, the method extracts the distinguishing pelage pattern which is then used to calculate the similarities between the animal in the queried image and the known individuals in the database using a metric learning [5]. The main parts of the framework are shown in Figure 2. The thesis is related to the CoExist project funded by the South-East Finland - Russia Cross- Border Cooperation (CBC) 2014-2020 program [6]. CoExist is a collaborative project between LUT and a number of other excellent research institutes. LUT’s lead partner in this project is the Saimaa ringed seal research group from the University of Eastern Finland (UEF) [7].

Input (image) Preprocessing Segmentation Pattern

extraction

Patch extraction Computing

similarities Matching

Output (ID)

Figure 2.Re-identification framework.

1.2 Objectives and delimitations

The goal of this research is to build a fully automated and species agnostic re-identification framework which, given an image, is capable of detecting or segmenting the animal in the

(9)

image if present, and further utilize the unique pattern on each animal to match it with the most similar individual known in the dataset.

The objectives of this research can be summarized as follow:

1. To survey and to collect publicly available photo-identification datasets for different animal species.

2. To generalize the existing Saimaa ringed seal identification method to the selected datasets.

3. To build a general framework that applies the method to any animal species with a distinctive fur, pelage or skin pattern.

This thesis is delimited to re-identify only animals which are known to have a distinguishing pattern such as zebras, giraffes, tigers and whale sharks, so the method should not be expected to re-identify all animal species.

1.3 Structure of the thesis

Chapter 2 focuses on related work on animal re-identification with emphasis on image- based identification methods. Publicly available datasets for animal re-identification are introduced, and the methods commonly used in animal segmentation, pattern extraction and individuals matching are explained. The chapter also introduces the Saimaa ringed seals re-identification method which this thesis aims to extend. Chapter 3 introduces the use of convolutional neural networks, more specifically the Siamese networks in animal re-identification. Chapter 4 explains in details, the procedure to solve each step leading to animal re-identification. Chapter 5 focuses on the dataset used, details of the experiment, and discussions about the results from the three main steps in animal re-identification.

Finally in Chapter 6 and 7, a general discussion on the methods and future study, and the conclusion are given respectively.

(10)

2 CONVOLUTIONAL NEURAL NETWORKS AND MET- RIC LEARNING

2.1 Building blocks and architecture

In the past years, deep learning has proved to be a very powerful tool because of its ability to handle large amounts of data. The interest to use hidden layers has surpassed traditional techniques, especially in pattern recognition. One of the most popular deep neural networks are Convolutional Neural Networks (CNNs).

CNNs are a class of Neural Networks commonly used to analyze visual data [8]. The CNN architecture consists of the input layer, output layer, and the hidden layers; a set of interconnected convolutional layers, pooling layers, fully connected layers designed to learn the spatial hierarchies of features through backpropagation [9]. The layers are composed of several artificial neurons which are mathematical functions that calculate the weighted sum of multiple inputs to give an output value, imitating the biological neurons found in humans. The hidden layers can be constructed using a connected combination of the following blocks [8]:

1. convolutional layer where a kernel is convolved around the input to extract features to be passed to the next layer.

2. Pooling layer is responsible for reducing the spatial size of the convolved feature.

The two types of pooling are average pooling and max pooling. In average pooling, a kernel is moved over the input and the average of the pixels in that region is the pooled representative. The maximum value is chosen to represent the region of interest for the max pooling case.

3. Fully connected layer generates probabilities for each class based on the output of the neurons.

The first convolutional layer usually extracts basic features such as horizontal or diagonal edges. This output is passed on to the next layer which detects more complex features such as corners or combinational edges. As the input traverse the network, the higher layers can identify even more complex features such as objects and faces. In Figure 3, a CNN used to classify handwritten digits is shown. The first part of the network consists

(11)

of two convolutional layers each followed by a max pooling layer to learn the features and the second part two dense layers are used to classify the input [10].

Figure 3. An example of a typical CNN architecture. [10]

2.2 Metric learning

In animal re-identification tasks, working on a dataset where only few images representing each animal individual is quite challenging. The available data in most cases is not enough to build a classification model which can accurately distinguish the animal individuals from each other. An alternative is the use of metric learning, where images are mapped to a metric space. Images from the same class tend to be closer while images from different classes are further apart. This follows from the assumption that the maximal intra-class distance is smaller than the minimal inter-class distance under a specific metric space [11].

Metric learning aims to measure the similarity among samples by using an optimal distance metric for the learning tasks. The metric learning method which uses a linear projec- tion, is limited in solving real-world problems demonstrating non-linear characteristics.

Kernel approaches are utilized in metric learning to address this problem. In recent years, deep metric learning, which provides a better solution for nonlinear data through activa- tion functions has attracted researchers’ attention in many different areas.

The idea of metric learning is to transform the data into a new metric space, where samples from the same class are closer and are disjoint from samples from different classes. In Figure 4, metric learning is applied by using the siamese network to learn the features and euclidean distance as the distance metric. To reach this results, several factors such as the choice of base network, the loss function used to penalise the learning procedure, and lastly the selection of samples needs to be considered [12].

(12)

Figure 4. Illustration of classes separation using metric learning. [12]

2.3 Training

The network learns the task by comparing the output of the network with the expected output using a loss function. The loss function is responsible for determining the residual between the output of the network and the expected output. The residual is then utilised by the gradient descent algorithm to determine how the weights are updated to ensure a result closer to the expected output [13]. Training a neural network typically consists of two phases:

• A forward phase where the input is passed completely through the network, and

• A backward phase where gradients are backpropagated with the weights updated.

These steps are done iteratively until the network reach its task. During the forward phase, each layer will cache any data it needs for the backward phase such as inputs and interme- diate values. This means that any backward phase must be preceded by a corresponding forward phase. During the backward phase, each layer will receive a gradient and also return a gradient. It will receive the gradient of loss with respect to its outputs and return the gradient of loss with respect to its inputs.

(13)

3 ANIMAL RE-IDENTIFICATION

3.1 Re-identification workflow

An animal re-identification model is built with the purpose to find a distinct identifier for each animal individual based on an image of the animal. The images are captured by different camera traps, meaning that the position and angle in which the images were taken can not be assumed to be similar all the time. This issue needs to be taken into consideration when building the dataset by collecting images of the animal taken from different position to cover all the possible patterns that might appear in one image, ensuring more reliable results [1].

In order to re-identify an animal individual based on an image, three main steps need to be followed. The first step is to detect the animal in the image and segment it from the background. Second step is feature extraction, which is done by extracting the pattern from the segmented animal’s skin, fur or feather. The third and final step is the search for a matching individual where the extracted pattern is compared with the pattern of the known individuals to find the most accurate match [5]. In Figure 5, an illustration of the re-identification steps applied on a seal individual where unsupervised segmentation and super-pixel classification are used to segment the seal from the background is shown [14].

Input image

Unsupervised

segmentation Superpixel

classiﬁcation

Segmented image Post-processing

Identiﬁcation by matching

Figure 5. Re-identification steps of Saimaa ringed seals. [14]

(14)

3.2 Existing datasets

Various publicly available datasets for animal re-identification exist to assist researchers in the development of accurate and robust algorithms.

The Whale Shark ID dataset [15] contains 7888 images of whale sharks (Rhincodon typus) taken during several years. For each image a bounding box surrounds the visible whale and two labels, one defines the pose of the whale and the other identifies the whale individual. Based on the unique spot patterning on the whale sharks, 543 individuals were identified within the data set using the first computer-assisted spot pattern recognition followed by manual reviews [16]. In Figure 6 [16], examples of the images available in the Whale Shark ID dataset are shown.

Figure 6.Example of images from whale shark ID dataset. [16]

The Great Zebra and Giraffe Count and ID dataset [17] contains 4,948 images of only two species, Great Zebra (Equus quagga) and Masai giraffe (Giraffa tippelskirchi) with visible animal in the images surrounded by bounding boxes, and an identifier of individuals is assigned as well. These images were taken by a group of scientist and photographers in 2015 at Nairobi National Park. The individuals were identified using the HotSpotter algorithm [18]. It is also worth noting that the dataset contains a high number of animals

(15)

which were spotted only once [17]. In Figure 7 [19], examples of the images available in the Great Zebra and Giraffe Count and ID dataset are shown.

Figure 7.Example of images from the Great Zebra and giraffe count and ID dataset. [19]

The Amur Tiger Re-identification dataset [20] was collected from 10 different zoos in china. The dataset consists of around 8000 videos capturing 92 individual Amur tiger.

The animals in the vedios are marked with a bounding box, pose keypoints and only 40%

from the 9500 bounding boxes are linked to an individual tiger ID [21]. In Figure 8 [20], examples of the images available in the Amur Tiger Re-identification dataset are shown.

Figure 8. Example of images from The Amur Tiger Re-identification dataset. [20]

Tim et al. [22] attempted to tackle the problem of manualy annotating the overwhelming

(16)

amount of images generated by camera traps by automating the labeling procedure. As a result, the JaguarID dataset was built. The dataset contains 176 images of 16 jaguar individuals, these images were collected from various sources: camera traps in the wild, crowd sourcing in zoos in addition to some images from Flickr. With various resolution levels, motion blur and two color spaces. In Figure 9 [22] an example of the images available in the dataset is shown.

Figure 9.Example of images from the JaguarID dataset. [22]

3.3 Animal segmentation

Image segmentation is the process of dividing an image into smaller segments in an at- tempt to find a segment that contains only the desired object, the rest of the image is considered as background and hence dropped. Convolutional neural network approaches have been able to achieve outstanding results when it comes to object segmentation, even in images taken by camera traps in the wild where the environment changes radically beside the inherited low quality of images taken using these types of cameras.

Zhelezniakov et al. [23] proposed the use of an unsupervised segmentation coupled with super-pixel classification in order to segment the seal from the background. Furthermore, a simple identification method exploiting texture features is proposed. Figure 10 [23]

shows example of the unsupervised segmentation. Although in [14] additional preprocessing steps were added to enhance the segmentation outcome, the re-identification result was still not enough to replace the existing solutions for Saimaa ringed seals re- identification.

(17)

Figure 10. Segmentation results (from the left to the right): the input image, unsupervised segmentation (superpixels), the segmentation result. [23]

Convolutional Neural Network (CNN) based segmentation approaches have shown sub- stantial improvement compared to previous approaches which focused only on hand- crafted representations.

Nepovinnykh et al. [5] used the state-of-the-art deep learning model DeepLab [24] to segment the Saimaa ringed seals from images. The segmentation process is followed by two steps of postprocessing: the closing of the holes in the pattern using sliding windows convex hull and the smoothing of the borders using a gaussian filter with thresholding.

This final step ensures that the pelage pattern is fully covered in the seal segment and hence enhance the identification result. In Figure 11, examples of the segmentation of two Saimaa ringed seal animals are shown [5].

Figure 11.Examples of segmentation results. [5]

Konovalov et al. [25] used a Fully Convolutional Network (FCN) to detect an individual

(18)

minke whale and localize the recognized unique features. In Figure 12, an example of the segmentation of Mink Whale individual using AMWR model is shown [25].

Figure 12.Example of AMWR per-pixel prediction for MW1020 individual. The pixels with the prediction heat-map values above 0.99 were illustrated by amplifying the corresponding image pixel intensities by factor of 1.5. [25]

3.4 Pattern extraction and matching

Biometric re-identification methods rely on the natural discriminating features in each animal without the need to attach an external object to the animal, making it both effective and non-invasive. Pattern extraction is the step in which the image can be down-sampled to either features or vectors to be used in a more complex neural network in order to match the patterns.

Some researchers relied on the full or partial shape of an animal to determine its identity.

Yeleshetty et al. [26] developed a re-identification method for cows re-identification using the face of a cow as a discriminating feature. After registering the three-dimensional (3D) representation of the face to a specific pose, the cow is identified using the Iterative Closest Point (ICP) method as shown in Figure 13 [26, 27]. The method is able to successfully identify99.53%of the animals. A small dataset used to evaluate this method consisting of only 32 cows.

(19)

Figure 13.An illustration of the ICP based re-identification method. [26]

Kumar et al. [28] proposed a framework for the re-identification of individuals in a cat- tle. The features used in this framework are extracted from the texture on an animal’s muzzle. The features are extracted and encoded using a Deep Belief Network (DBN) frameworks. The model achieved an accuracy of 98.99%. The dataset used to evaluate the model contains 90 images of 15 individuals. One downfall for this framework is that it requires a clear head shot which requires the restriction of the animal’s movement. The discriminatory features (beads and ridges) are shown in Figure 14 [28].

Figure 14.Shows the extracted features from the muzzle’s texture. [28]

The most common type of features used to differentiate between animal individuals are based on the texture of the animal’s skin, fur or feather. Nepovinnykh et al. [5] used the Sato tubeness filter [29] to extract the pelage pattern on Saimaa ringed seals and with the use of siamese network the similarities between the extracted pattern and the known individuals is computed to identify the individual. Nipko et al. [2] used the Scale Invariant Feature Transform (SIFT) to identify, extract, and describe unique features in the image. Using SIFT gives the model the ability to analyse images taken in uncontrolled environment, for example images captured by camera traps.

(20)

The final step in the re-identification process is the matching of the individuals by comparing the uniquely extracted patterns with the known patterns in the database. To identify individual jaguars from images, a set of matching jaguar’s rosettes across two images of jaguars is needed. The method proposed by Timm et al. [22] first detects the corners since jaguars tend to bend resulting in strange poses, then the corners are matched using SIFT descriptors. In the last step the Random Sample Consensus (RANSAC) was used to find groups of matching rosettes according to an affine transformation, the number of matching groups represents the similarity score. This approach was able to achieve an accuracy of91.5% on a dataset containing 176 images of 16 jaguar individual. In Figure 15 [22]

the matching of Jaguar individuals from different poses is shown.

Figure 15. Jaguar ID: Matched SIFT descriptors across images using RANSAC to estimate an affine transformation. [22]

Nepovinnykh et al. [5] built an algorithm for the re-identification of the Saimaa ringed seals based on the patch similarities and topology-preserving projections (see Figure 16).

The algorithm can be separated into three main steps. The first step is the patch-similarity heatmap generation to select candidates of corresponding patches in the gallery image, the second step is candidate filtering using topology-preserving projections and the third step is candidate ranking. The model outputs the top-5 most similar individuals from the database of known individuals. To measure the impact of using the extracted pattern instead of the full original image on the matching problem, the triplet network was trained twice on the original and extracted patterns separately. Upon testing the two trained networks it was reported that using the extracted pattern increased the matching accuracy for up-to12%. The dataset the author used in this experiment consisted of around 100,000 well segmented images of Saimaa ringed seal animals.

(21)

Figure 16.Saimaa ringed seal identification based on pelage pattern patches. [5]

Nipko et al. [2] used Wild-ID for re-identification. Utilizing SIFT to overcome the difference in scale, illumination and orientation when taking the different pictures. To determine the matching individual, a pairwise comparison of the shape and relative geometry of SIFT features between the images is performed and the score is assigned accordingly.

Nipko et al. [2] built a second model based on the HotSpotter algorithm [18] where SIFT and Local Naïve Bayes Nearest Neighbor algorithm [30] are combined to calculate the similarity score. The HotSpotter scored better results compared to Wild-ID. The HotSpot- ter selected a correct match as its top rank 71–82%of the individuals, whereas the success rate for Wild-ID was 58–73%of the individuals. The dataset used to compare the performance of the two methods contained 359 images of jaguars and 332 images of ocelots.

Nepovinnykh et al. [31] handled the problem of animal re-identification as a classification task, and transfer learning was used to train a model that is able to classify Saimaa ringed seal individuals. The model manged to achieve a high accuracy, this accuracy is con- strained by the availability of a large number of data images for each individual. Judging from the currently available datasets this is not an easy requirement to meet, since in most of these datasets only one sightings for a high percentage of the individuals is available.

Konovalov et al. [25] solved the re-identification problem using a Fully Convolutional Network (FCN) based model. The FCN-8s model is able of detecting an individual minke whale and localize the recognized unique features. The Automatic Minke Whale Recog- nizer (AMWR) model achieved a93%accuracy in recognizing Mink Whale individuals.

The dataset contained 1320 images of 76 whale individual.

Also non-image features have been used. Clemins et al. [32] proposed a re-identification method based on animal’s vocal characteristics. They developed a model that identifies elephant individual based on its voice using a hidden Markov model (HMM). The model

(22)

achieved an accuracy of 82.5%, which is considerably low given the small number of individuals to be identified being only 7 elephants. The data set contained 143 separate audio recording.

3.5 Saimaa ringed seal re-identification

In the re-identification of Saimaa ringed seals the pelage pattern on the fur have been commonly used as differentiation factor since it uniquely identify each seal individual (see Figure 17). It is worth noting that seals have non-rigid shape which results in a large number of possible poses, this needs to be taken into account when a matching algorithm for re-identification is constructed.

Figure 17.Example of Saimaa ringed seals. [5]

Nepovinnykh et al. [5] proposed a full framework for Saimaa ringed seals re-identification.

The data used was collected from camera traps which introduced a problem of images of seals individuals taken in a same pose, background and illumination effect which might result in distracting the model from learning the identifying pattern and rather end up learning the background of the image for example. This is how the authors justified the need for the segmentation step.

Deeplab [24] is a CNN based model that provides semantic segmentation of objects in an image by specifying a class for each pixel in the image. The model consists of two parts: an encoder which encodes multi-scale contextual information using atrous convo-

(23)

lution at multiple scales and a decoder that gives accurate segmentation boundaries, in Figure 18 [24] an overview of the DeepLab architecture is shown. The use of Atrous Spatial Pyramid Pooling (ASPP) allows the model to robustly segment objects at multiple scales. The localization of object boundaries was improved significantly by combining methods from deep CNNs and probabilistic graphical models. Nepovinnykh et al. [5]

used the Deeplab model to equip the framework with a robust and precise segmentation of seals with different sizes.

Figure 18.DeepLab model architecture. [24]

Following the segmentation two postprocessing were introduced: the closing of the holes in the pattern using sliding windows convex hull and the smoothing of the borders using a Gaussian filter with thresholding. These two steps ensures that the pelage pattern is fully covered in the seal segment and hence enhance the identification result.

After segmenting the seal, the pelage pattern is extracted in order to reduce the amount of data the model needs to learn and to remove the noise which might be the result of a different weather conditions or variation in illumination. The pattern extraction algorithm used was based on Sato tubeness filter [29] to detect the continuous ridges of the pelage pattern. The pattern extraction was carried in the following steps [5]:

1. Detect the continuous ridges of the pelage pattern using sato tubeness filter [29].

2. Sharpen the pattern using an unsharping mask.

(24)

3. Remove the segmentation border which was falsely detected as part of the pattern by the filter.

4. Apply morphological opening using a disk to remove small artifacts from the image.

5. Make the image brighter using adaptive histogram normalization.

6. Highlight the pattern using otsu’s thresholding and zeroing pixels below it.

7. Apply morphological opening again using a disk to remove small artifacts which might have resulted from thresholding.

8. Apply a weaker unsharping mask to keep the pattern well defined.

Figure 19 Shows the pattern extraction steps, starting from the segmented seal as input and outputs a grayscale image containing the extracted pattern.

Figure 19. Visualization of pattern extraction result. First row: Steps 1–4 of the algorithm (from the left to the right). Second row: Steps 5–9 of the algorithm (from the left to the right). Third row: the source image (left) and the end result of pattern extraction (right). [5]

Triplet Neural Network (TNN) [33] is an extension of the Siamese Neural Network used to learn a similarity metric using a triplet loss function. TNN architecture consist of two parts: the convolutional part which learns the features of the sample and the fully- connected part which encodes the vector. TNN takes as input three samples: anchor, positive and negative. Anchor is the base sample, positive is a sample of the same individual as anchor and negative is a sample from a different individual than the anchor. The

(25)

goal of the network is to learn an embedding such that the distance between the encoding vectors of samples from the same individual is shorter than the distance between samples of different individuals. Assuming the first encoding was the worse case where the encoding vectors of the negative sample was closer to the anchor than the positive sample, the network keeps on iterating until it learns to encode in such a way that the positive encoding is closer as shown in Figure 20.

Figure 20.Triplet Neural Network iterations. [34]

Mathematically, TNN reaches the right encoding by minimizing the triplet loss function which calculates the distance between each pair of encoding vectors usingL₂metric. The model converges when the difference between the distance from positive to anchor and the distance from negative to anchor is greater by a marginmas follows:

L_triplet (xa, xp, xn) = max(0, m+

kf(x_a)−f(x_p)k²₂− kf(x_a)−f(x_n)k²₂ (1)

Nepovinnykh et al. [5] uses a TNN to find the corresponding matches of pattern patches from the dataset of known individuals. After extracting the pattern, the image is cropped and rescalled to a common size ensuring the scale is relatively similar to the size of the seal. Then the pattern gets divided into multiple overlapping patches with a fixed size of 160×160pixels.

To train the TNN, three patches are passed to the network: anchor as a base, positive a patch from the same individual as the anchor and negative a patch from a different individual. To account for the variation which might result from the large number of possible poses the seal might be in as the image is being captured, seven rotated version of each patch are passed to the first part and the result is summed before it is send to the fully-connected part. The network encodes the features of each patch in a vector with a fixed size512three encoding vectors with a fixed size of512corresponding to the input. The distance between the two pairs is calculated and the network keeps on iterating

(26)

until the distance between the anchor and the positive is shorter by a pre-defined margin than the distance between the anchor and the negative [5]. The process is illustrated in Figure 21.

Convolutional part

Fully Connected 512

Convolutional

part Fully

Connected 512

Convolutional

part Fully

Connected 512

Encoding Vector

Distance Anchor

Positive

Negative Distance

Figure 21.Triplet Neural Network training. [5]

The final step is the re-identification of the seal individual. Two methods based on patch comparison were used. In the first method a simple K-Nearest Neighbour (KNN) was used to identify seal individual where each patch would vote for the seal it belong to and the sum of the weighted votes is how an individual is decided. This method was able to correctly identify82.5%individuals when it provided the top-5 matches.

The second method is a Heatmap-based topologically-aware patch matching algorithm [5].

The method can be divides into three steps as follows:

1. Patch-similarity heatmap generation to select candidates of corresponding patches from the gallery image .

2. Filtering candidates using an angle-based method to preserve the topological con- sistency.

3. Individual ranking.

In order to generate the heatmaps the trained network is used to allocate an encoding vector for each patch andL₂ metric is used to find the distance between the patch in the

(27)

query image and patches from the dataset. Local minima of the heatmap which indicates patches with high similarities are identified and are marked as candidates [5]. In Figure 22 examples of the generated heatmaps are shown. Upon identifying the candidates from the patches the variation from the expected location is quantified by measuring the angel and the patch with the smallest variation is chosen as the corresponding patch. By calculating the variations of all the patches in an individual’s pattern the rank of this individual can be calculated by averaging the variation of the filtered patches. Individuals with the smallest average variation get ranked higher.

Figure 22. Examples of the region similarity heatmaps: the query image (left) and gallery image (right). Heatmaps for a query image highlight a single region that is being compared to the entire gallery image. Heatmaps for the gallery image show regions which are most similar to the highlighted region from the query image. [5]

The topology aware algorithm was able to correctly identify 88.6% individuals when it provided the top-5 matches. In Figure 23 a visualization of the full frame work is shown.

While the level of accuracy is not enough for the framework to be used alone for reliable re-identification, the authors believe that providing the top-5 candidates will help the experts reach an accurate decision about the seal identity faster.

Chelak et al. [35] proposed a novel feature pooling approach that creates an embedding vector of the image by aggregating the local pattern features, taking into account the spatial distribution of features. The generated embeddings can then be used to re-identify the different Saimaa ringed seals individuals. The proposed method was able to achieve up to 86.54% of accuracy when matching pattern patches.

(28)

Original Image Segmentation

Patch Extraction

Bounding Box &

Gray Scale

Pattern Extraction Triplet CNN

Identification results

Figure 23.Saimaa ringed seal re-identification framework. [5]

(29)

4 SPECIES AGNOSTIC ANIMAL RE-IDENTIFICATION FRAMEWORK

4.1 Pipeline

An animal re-identification model is built with the purpose of finding a distinct identifier for each individual based on an image of the animal. The images are captured by different camera traps, meaning that the position and angle in which the images were taken can not be assumed to be similar all the time. This issue needs to be taken into consideration when building the dataset by collecting images of the animal taken from different position to cover all the possible patterns that might appear in one image, ensuring more reliable results [1].

In order to re-identify an animal individual based on an image, three main steps need to be followed. The first step is to detect the animal in the image and segment it from the background. The second step is feature extraction, which is done by extracting the pattern from the segmented animal’s skin, fur or feather. The third and final step is the search for a matching individual where the extracted pattern is compared with the pattern of the known individuals to find the most accurate match [5]. In Figure 24 an illustration of the re-identification steps is shown.

Input Segmentation

Pattern extraction Patch extraction

Features extraction Encode image

Calculate Cosine Distance Find possible matches Output (IDs)

Figure 24.Re-identification framework.

(30)

4.2 Segmentation

Due to the nature of animals to live and travel within a herd, it is expected for images captured by camera traps to include multiple animals. As previously mentioned, any additional information from an image beside the animal individual to be identified might result in a model that learns different thing than expected. This is why the first step of re-identification task is to not only remove the background but also the other animal individuals which might appear in the image.

Detectron2 [36], a model library built on state-of-the-art for object detection and instance/semantic segmentation can be utilized to accurately segment the objects of interest.

The available models have been pre-trained on a number of animals, making it a suitable solution for the first step in this re-identification framework. Mask R-CNN model is chosen to perform animal segmentation [37]. In Figure 25 the result of using a pretrained Mask R-CNN R 50 FPN model on a scene is shown. Mask R-CNN is a baseline model which has proven to be both fast and accurate. This model is built on a ResNet network with 50 layers and uses Feature Pyramid Network (FPN) in extracting the features to make the network robust against change in scale. This combination gives the network an excellent gain in both accuracy and speed [37].

Figure 25.Example of results of instance segmentation using Mask R-CNN.

(31)

4.3 Pattern extraction

The method for extracting the pattern on animal’s fur slightly differ depending on the animal under study. For example, although Saimaa ringed seals are known for their low mobility, the large variability of poses and the non-rigid nature of its body requires the use of Sato tubness filter [29] to extract a clear pattern. This is not always the case.

For animals with a rigid posture such as zebras, tigers or giraffes, the animal’s pattern extraction can be further simplified. For instance, Zebras pattern can be extracted by simple thresholding. In Figure 26, an example that shows the quality of pattern extraction of Zebras using global thresholding [19].

Figure 26.Examples of animal pattern extraction.

For Giraffes a simple thresholding method might not be enough to extract a pattern with high quality. This might come as a result of fur color, making it affected more by the variation in illumination. Instead of using a global threshold, adaptive thresholding methods followed by several post processing steps is proposed to overcome the illumination problem and gives a final output with a clear pattern. The pattern extraction steps can be summarized as follow:

1. Adaptive thresholding, instead of using a global threshold the threshold value is calculated for small regions of the image as the weighted sum of neighbourhood values where weights are a gaussian window is used to get a binary image.

2. Dilation using a structured disk.

3. Erosion using a structured disk.

(32)

4. Edge detection to detect smooth edges using Robert’s algorithm [38].

5. Removing segmentation borders, to later allow the model to focus on the actual patterns on the fur rather than the outer build of the animal.

The result of the pattern extraction step is a grayscale image with clear representation of the pattern, the process of the pattern extraction steps are demonstrated visually in Figure 27 [19].

(a) (b) (c)

(d) (e) (f)

Figure 27. Pattern extraction steps applied on giraffe: (a) Input in RGB format; (b) Adaptive thresholding; (c) Dilation; (d) Erosion; (e) Edge detection using Robert’s algorithm; (f) Removing segmentation borders.

4.4 Re-identification

Adopting a metric-based method for animal re-identification generally allow us to split the re-identification step into two phases. The first phase is learning of discriminative features using a proper loss function. The second phase is choosing a distance metric to match the individuals based on the distance between the features. In many recognition models the spread of features produced by the loss function is quite large causing the boundaries between classes to be blurred thus make it difficult for the model to correctly classify images. Thus, finding a loss function that reduces this spread by pushing together

(33)

images of the same individuals while simultaneously push away images belonging to others a crucial requirement to solve the re-identification task.

Softmax is the most commonly used loss function for recognition [39]. It works by taking a vector of logits and normalize it to be a probability distribution for each class. The loss function then calculates the distance between what the distribution of the output should be and what the original distribution is really is. Softmax does not explicitly optimise the feature embedding to enforce higher similarity for intra-class samples and diversity for inter-class samples, which might result in a performance gap when working with dataset with large intra-class appearance variation, for example pose variation and age gaps. This causes the classes to be clustered closer together and hence hinder the performance of the model. Using triplet loss to supervise the learning procedure can solve this problem but it suffers from combinatorial explosion in the number of triplets especially when working on large-scale datasets. Another problem when using triplet loss is that it requires a carefully designed triplet mining procedure, which can be both time-consuming and performance- sensitive [11]. A prominent alternative can be angular based losses.

An alternative to the standard softmax is the angular margin loss proposed in [11]. The authors assume that the linear transformation matrix in the last fully connected layer can be used as a representation of the class centres in the angular space and therefore penalises the angles between deep features and their corresponding weights in a multiplicative way.

Unlike softmax loss, angular margin loss has clear geometric interpretation. Supervised by A-Softmax loss, the learned features construct a discriminative angular distance metric that is equivalent to geodesic distance on a hypersphere manifold [11]. It enables CNNs to learn angularly distributed features. As can be seen in Figure 28, compared to the original softmax loss, the features learned by modified softmax loss are angularly distributed, but not necessarily more discriminative. On the other hand, by optimizing the A-softmax loss, the decision regions become more separated, simultaneously enlarging the inter- class margin and compressing the intra-class angular distribution. Therefore, A-softmax is more suitable for re-identification compared to the standard softmax.

Figure 28.Comparison among softmax loss, modified softmax loss and A-Softmax loss. [11]

(34)

In [40], the authors argue that another interesting loss function is Additive Angular Margin Loss (ArcFace) has a clear geometric interpretation due to its exact correspondence to geodesic distance on a hypersphere. ArcFace uses additive angular margin to enforce the intra-class compactness and inter-class diversity by penalising the target logit. It starts by calculating the angle between the current feature and the target weight using the arc- cosine function. Afterwards, it adds an additive angular margin to the target angle, and we get the target logit back again by using the cosine function. Then it re-scales all logits by a fixed feature norm. In the last step softmax function is applied to the resulting logit to get the categorical distribution. In Figure 29 a Deep Convolutional Neural Network (DCNN) is trained using Arcface loss.

Figure 29.Training a DCNN for face recognition supervised by the ArcFace loss. [40]

The advantages of using ArcFace loss can be summarised as follows [40]:

• ArcFace directly optimises the geodesic distance margin by virtue of the exact correspondence between the angle and arc in the normalised hypersphere.

• It is capable of achieving state-of-the-art performance on ten face recognition bench- marks including large-scale image and video datasets. It does not need to be combined with other loss functions in order to have stable performance and can easily converge on any training datasets.

• It only adds negligible computational complexity during training.

The second phase to be considered is choosing a compatible distance function. Even though euclidean distance have been widely used to define margins, features learned by softmax losses have an intrinsic angular distribution. Therefor, using Euclidean distance

(35)

is not reasonable [11]. Instead, the cosine distance can be used to measure the distance between feature vectors computed using softmax based losses. Algorithm 1 describes the steps of re-identification of animal individuals using CNN and global pooling.

Algorithm 1Re-identification using the full pattern Input: Gray-level image I

Output: IDs of the top-n matches

1. Use the trained CNN to get a feature vector of the query image I.

2. Calculate the Cosine distances between the feature vector of the query image and each image in the gallery of known individuals.

3. Based on the calculated distances, the IDs of the top-n matches are presented.

Since CNNs are to some extent limited when it comes to dealing with images that have large variation in size and clutter, Fisher Vector (FV) can be used instead to generate a more reliable encoding of the image [41]. FV generates the encoding vector by aggregating local descriptors with a universal generative Gaussian Mixture Model (GMM) [41].

Algorithm 2 describes the steps of re-identification of animal individuals using CNN and FV.

Algorithm 2Re-identification using patches from the pattern Input: Gray-level image I

Output: IDs of the top-n matches

1. Divided the query image I into overlapping patches.

2. For each patch get the corresponding encoding vector.

3. Calculate the cluster’s parameters using GMM from the encoding vectors of the patches.

4. Compute the fisher vector representation of the query image I.

5. Calculate the Cosine distances between the query image and each image in the gallery of known individuals.

6. Based on the calculated distances, the ID top-n matches are presented.

(36)

5 EXPERIMENTS

5.1 Data

5.1.1 Zebra Dataset

Zebra dataset was constructed from the Great Zebra and Giraffe Count and ID dataset [19]

to contain only images of Great Zebra (Equus quagga). It is worth noting that the original dataset contains a large number of animals which were spotted only once. Therefore, individuals with a small number of sightings were dropped when constructing this dataset.

The constructed dataset consisted of around 400 images of 94 Zebra individuals.

54 individuals were used to train the model and 40 for testing. The testing dataset was split into two sets. The first is the gallery set, contains a pool of known individuals. The second is the query set, containing images of animals to be re-identified by the model as one of the individuals in the gallery set.

5.1.2 Giraffe Dataset

Giraffe dataset contains the second species found in the Great Zebra and Giraffe Count and ID dataset [19], the Masai giraffe (Giraffa tippelskirchi). Following the obvious requirement for the individuals included in the re-identification dataset to have more than one image from the same view, the constructed Giraffe Dataset consisted of around 300 images of 70 Giraffe individuals.

5.2 Evaluation criteria

To evaluate the proposed method quantitatively, the accuracy metric is used. The accuracy is defined as the percentage of correct predictions for the test data. In evaluating a model which predicts the Top-n matches, the prediction for a query image is considered correct if the correct ID appeared at least once in the Top-n matches presented by the model. And

(37)

hence the accuracy can be calculated as

Top-n accuracy=

N

P

i=1

min 1,

n

P

j=1

p_ij

N (2)

p_ij =







1 ifp_ij ==q_i 0 else

(3)

whereN is the number of query images, p_ij ID of thej^th match for query imageiandq_i is the ID of the query imagei.

5.3 Description of experiments

5.3.1 Experiment A

The main aim of the experiment was to learn discriminative features from the full pattern of an animal using angular softmax to ultimatly solve the problem of animal individuals re-identification. The experiment was done on Zebra dataset and Giraffe dataset.

The first step was to prepare the dataset for training as follow:

• Mask R-CNN model was applied on the raw images to segment the animals from the back ground.

• The bounding box of the non-zero regions in the image was used to resize the image and keep only the part showing the animal. This step was done to reduce the amount of the data.

• Global thresholding adaptive thresholding were used on images from the Zebra and Giraffe dataset respectively to extract the patterns.

• For each individual, different sightings of the same individuals were stored in the same folder.

The second step in this experiment was to train a residual neural network with 20 layers to differentiate between different animal individuals using angular softmax loss. A network

(38)

wrapper was added to apply global pooling to down sample the feature maps. The pooling layer was added to ensure that the resulting vector is robust to changes in the position of the features in the input image. The input images were normalized withµ= 0.5 andσ = 1.

The third step was to test the model, the network takes as input the query image and outputs a feature vector of fixed size 512. This vector is used to find the identity of the individual by calculating the Cosine distance between the query feature vector and each vector corresponding to images in the gallery.

The top 5 individuals with the shortest distance from the query vector were then proposed as possible matches.

5.3.2 Experiment B

To solve the re-identification problem, instead of using the full pattern, patches representing the key features on the pattern were used to train the network. The network was trained to learn the discriminative features following the Additive Angular Margin Loss.

The experiment was done on Zebra dataset 5.1.1.

The first step was to prepare the dataset for training as follow:

• Mask R-CNN model was applied on the raw images to segment the animals from the back ground.

• The bounding box of the non-zero regions in the image was used to resize the image and keep only the part showing the animal. This step was done to reduce the amount of the data.

• Global thresholding on each image was used to extract the patterns.

• For each individual, different sightings of the same individuals were stored in the same folder.

The second step was key-points extraction. All the images in the training set were manually annotated to identify four key-points on the image. Patches with different sizes were extracted around specific key-points. The size of the square patch was not constant.

Instead it was defined as a percentage from the overall area of each image.

(39)

The third step was the network training. To ensure fairness in comparing the two methods, the backbone network architecture is similar to the one used in experiment A. The first difference is in the loss function, Additive Angular Margin Loss is used instead of angular softmax loss. The second difference is the use of fisher vector to get an encoding of the image instead of the additional pooling layer.

The fourth step was to test the model, the steps goes as follow:

• The query image was divided into overlapping patches with85%overlap.

• The network was used to extract the features from each patch.

• Gaussian mixture model was used to find the cluster’s parameters.

• The parameters are then used to find the encoding of the image using fisher.

• This vector is used to find the identity of the individual by calculating the cosine distance between the query feature vector and each vector corresponding to images in the gallery.

• The top-n individuals with shortest distances from the query image are then proposed as possible matches.

The effect of a couple of other parameters were tested in this experiment. First, reducing the dimentionality of the data using PCA. Second, decreasing/increasing the percentage of overlap in the extracted patches.

5.4 Results

5.4.1 Segmentation

Example results of instance segmentation using Mask R-CNN model on the data images are shown in Figure 30. The model was able to get accurate segmentation masks on both zebra and giraffe datasets.

(40)

Figure 30.Example results of instance segmentation using Mask R-CNN implemented in Detec- tron2 library.

5.4.2 Pattern Extraction

Using thresholding to extract the pattern for images in the dataset resulted in extracting patterns in good quality. Global thresholding and adaptive thresholding were used to extract the patterns on the animals in zebra and giraffe datasets respectively (see Figure 31 and 32).

Figure 31.Example result of pattern extraction on zebra dataset. The first column shows the input image and the second column shows the extracted patterns.

(41)

Figure 32. Example result of pattern extraction on giraffe dataset. The first column shows the input image and the second column shows the extracted patterns.

5.4.3 Re-identification

Both Experiment A and B were evaluated using the same evaluation criteria mentioned in Section 5.2. Giraffe dataset was tested using Experiment A, where Angular softmax loss was used. The method achieved Top-5 accuracy of 82.1%. Zebra dataset was tested on the two methods. In Experiment A, the method achieved Top-5 accuracy of 74.6%.

In search for better result, Zebra dataset was tested using Experiment B. The second Experiment settings resulted in a jump in the accuracy. Although setting the overlap percentage of patches to 85% instead of 50% increases the computational power needed, the corresponding increase in the accuracy is appropriate. In Table 1, summary of the results obtained from Experiment A and B on the two datasets is shown.

It can be seen that the best result were obtained in Experiment B by using ArcFace loss to learn the deep features. Although using PCA has minor effect on the overall accuracy, increasing the overlap percentage parameter in the patches had a high effect on the re- identification accuracy. It increased the accuracy by 11%.

(42)

Table 1. Re-identification results using different methods and varying parameters. Overlap parameter specify the percentage of overlap between consecutive patches.

Dataset Loss PCA Overlap Top-1 Top-2 Top-3 Top-4 Top-5 Giraffe Experiment A: A-softmax - - 66.0% 67.0% 71.4% 73.2% 82.1%

Zebra Experiment A: A-softmax - - 60.2% 68.7% 69.8% 71.0% 74.6%

Zebra Experiment B: ArcFace Yes 50% 69.8% 74.7% 77.1% 79.5% 81.9%

Zebra Experiment B: ArcFace No 85% 75.9% 70.5% 83.1% 85.5 % 87.9%

Zebra Experiment B: ArcFace Yes 85% 75.9% 80.7% 85.5% 87.9 % 89.1%

(a)

(b)

(c)

(d)

(e)

Figure 33. Examples of re-identification results on zebra dataset: query images (left) and top 5 matches (right). Correct matches (same individual) are highlighted with green and incorrect ones with red. Tha cases shown: (a) Clear pattern; (b) Small portion of the pattern is lost; (c) Animal partly occluded; (d) Slight change in pose; (e) Big portion of the pattern is lost.

(43)

It is clear that the model learned both the pattern and the pose of the animal to differentiate between the individuals, this is a result from using a small dataset. A way around this is to perform the re-identification process using parts that are not affected much by the different poses, for example the torso. The model was found to be robust to changes in the animals pose and partial occlusions. The model fails mostly in cases where a great portion of the pattern is lost. For examples of the re-identification result on Zebra and giraffe datasets, see Figure 33 and 34.

Figure 34. Examples of re-identification results on giraffe dataset: query images (left) and top 5 matches (right). Correct matches (same individual) are highlighted with green and incorrect ones with red.

(44)

6 DISCUSSION

6.1 Current study

Nowadays, a large number of animal species are in danger of extinction as a result of the actions of humankind. This calls for the need of a counter-plan to help save and monitor animal populations in the wild, without causing any change in their environment. For- tunately, with the rapid advancements in computer vision, a species agnostic framework which can be used to monitor the animals remotely can be built. The key to its success is the re-identification model.

In this work the framework consists of three main steps. The first step is segmenting the animal using the pre-trained Mask R-CNN model. The results of the R-CNN model on Zebra and Giraffe dataset are near perfect for the two datasets. The second step is pattern extraction using thresholding. For the Zebra dataset, global thresholding was used to obtain the pattern. On the other hand, additional postprocessing steps were needed to find the final pattern for the Giraffe dataset. Although, adaptive thresholding was used to account for the varying illumination, the method was not fully successful. Some images were discarded because few or no pattern was visible at the end of the procedure. The third and final step is the re-identification of animal individuals. Two state-of-the-art losses: Angular softmax and ArcFace losses were used to learn a similarity metric that can be used to separate the different individuals. It is safe to say that the two experimented methods gave promising results. The method with ArcFace loss gave the best result of up to89.1%Top-5 accuracy when applied on the Zebra dataset, compared with the Angular softmax based method, which achieved74.6%Top-5 accuracy on the same dataset.

So far the framework has been tested on animals with rigid and non-rigid [35] forms, achieving accuracy in the range of74−89%. Currently the task of animal re-identification is performed manually by experts which takes considerable amount of time and effort.

Although one can not yet say that this framework is ready to take the role of the expert, it can certainly make the task much simpler and faster to complete by providing the expert with the Top-n possible matches for the individual in the query image.

(45)

6.2 Future work

From the obtained result it was clear that the method gave good results when the trained model was matching images with clear pattern. This gives a motive to work more on the pattern extraction step in order to increase the overall accuracy.

Instead of using a simple thresholding method to obtain the fur or pelage pattern, a CNN based method can be used instead to extract a pattern with high quality [42]. This in theory should increase the accuracy significantly since most of the cases where the model failed to recognize the individuals were of query images with a high percentage of loss in the apparent pattern. But first the dataset needs to be annotated in order to train a CNN.

(46)

7 CONCLUSION

The aim of the thesis was to generalize the re-identification framework and testing it on publicly available datasets. The proposed re-identification step was based on metric learning because the limited number of images available per individuals prevent the use of classification methods. Two similar models were introduced, with an exception in the loss function used to train the network. The methods were tested on Zebra and Giraffe datasets.

In conclusion, the methods were able to achieve good accuracy in the re-identification task on multiple datasets, achieving accuracy in the range of 74-89%. Currently experts have to do this re-identification task manually, requiring a lot of time and effort in the process. Although one can not yet claim that this framework can replace the role of an expert, it can certainly make the job much simpler and faster to finish by providing the Top-n possible matches for the individual in the query image.

(47)

REFERENCES

[1] Stefan Schneider, Graham Taylor, Stefan Linquist, and Stefan Kremer. Past, present and future approaches using computer vision for animal re-identification from camera trap data. Methods in Ecology and Evolution, 10(4):461–470, 2019.

[2] Robert Nipko, Brogan Holcombe, and Marcella Kelly. Identifying individual jaguars and ocelots via pattern-recognition software: Comparing hotspotter and wild-id.

Wildlife Society Bulletin, 44, 04 2020.

[3] Animal Skin Cliparts. http://clipart-library.com/clipart/1892553.htm, 2020. [Online;

accessed December, 28, 2020].

[4] Maximilian Matthé, Marco Sannolo, Kristopher Winiarski, Annemarieke Spitzen- van der Sluijs, Daniel Goedbloed, Sebastian Steinfartz, and Ulrich Stachow. Com- parison of photo-matching algorithms commonly used for photographic capture–

recapture studies. Ecology and evolution, 7(15):5861–5872, 2017.

[5] Ekaterina Nepovinnykh, Tuomas Eerola, and Heikki Kalviainen. Siamese network based pelage pattern matching for ringed seal re-identification. In IEEE Winter Conference on Applications of Computer Vision (WACV) Workshops, March 2020.

[6] CoExist - Towards sustainable coexistence of seals and humans. http://www2.

it.lut.fi/project/coexist/index.shtml. [Online; accessed January, 31, 2021].

[7] Ringed Seal Research. https://sites.uef.fi/norppa/. [Online; accessed May, 28, 2021].

[8] Maria Valueva, Nikolay Nagornov, Pavel Lyakhov, Georgii Valuev, and Nikolay Chervyakov. Application of the residue number system to reduce hardware costs of the convolutional neural network implementation. Mathematics and Computers in Simulation, 177:232–243, 2020.

[9] Vishwanath A Sindagi and Vishal M Patel. A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognition Letters, 107:3–16, 2018.

[10] Akinori Hidaka and Takio Kurita. Consecutive dimensionality reduction by canoni- cal correlation analysis for visualization of convolutional neural networks. InISCIE international symposium on stochastic systems theory and Its Applications, volume 2017, pages 160–167. The ISCIE Symposium on Stochastic Systems Theory and Its Applications, 2017.

(48)

[11] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song.

Sphereface: Deep hypersphere embedding for face recognition. InIEEE Confer- ence on Computer Vision and Pattern Recognition, pages 212–220, 2017.

[12] Mahmut Kaya and H. Bilge. Deep metric learning: A survey. Symmetry, 11:1066, 08 2019.

[13] Léon Bottou and Olivier Bousquet. 13 the tradeoffs of large-scale learning. Opti- mization for Machine Learning, page 351, 2011.

[14] Tina Chehrsimin, Tuomas Eerola, Meeri Koivuniemi, Miina Auttila, Riikka Levä- nen, Marja Niemi, Mervi Kunnasranta, and Heikki Kälviäinen. Automatic individual identification of saimaa ringed seals. IET Computer Vision, 12(2):146–152, 2018.

[15] Jason Holmberg, Brad Norman, and Zaven Arzoumanian. Estimating population size, structure, and residency time for whale sharks rhincodon typus through collaborative photo-identification. Endangered Species Research, 7:39–53, 04 2009.

[16] Labeled Information Library of Alexandria: Biology and Conservation - Whale Shark ID. http://lila.science/datasets/whale-shark-id. [Online; accessed December, 27, 2020].

[17] Jason Parham, Jonathan Crall, Charles Stewart, Tanya Berger-Wolf, and Daniel I Rubenstein. Animal population censusing at scale with citizen science and photographic identification. InAAAI Spring Symposium-Technical Report, 2017.

[18] Jonathan Crall, Charles Stewart, Tanya Berger-Wolf, Daniel Rubenstein, and Siva Sundaresan. Hotspotter-patterned species instance recognition. InIEEE Workshop on Applications of Computer Vision (WACV), pages 230–237, 2013.

[19] Labeled Information Library of Alexandria: Biology and Conservation - Great Ze- bra and Giraffe Count and IDD. http://lila.science/datasets/great-zebra-giraffe-id.

[Online; accessed December, 27, 2020].

[20] Shuyuan Li, Jianguo Li, Weiyao Lin, and Hanlin Tang. Amur tiger re-identification in the wild. arXiv preprint arXiv:1906.05586, 2019.

[21] Labeled Information Library of Alexandria: Biology and Conservation-Amur Tiger Re-identification. http://lila.science/datasets/atrw. [Online; accessed December, 28, 2020].

[22] M. Timm, S. Maji, and T. Fuller. Large-scale ecological analyses of animals in the wild using computer vision. InIEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1977–19772, 2018.

Metric learning based pattern matching for species agnostic animal re-identification