CNN-based ringed seal pelage pattern extraction

(1)

Computational Engineering and Technical Physics Computer Vision and Pattern Recognition

Denis Zavialkin

CNN-BASED RINGED SEAL PELAGE PATTERN EXTRACTION

Master’s Thesis

Examiners: Professor Heikki Kälviäinen

Associate Professor Konstantin Nadolin Supervisors: M.Sc Ekaterina Nepovinnykh

D.Sc Tuomas Eerola

Professor Heikki Kälviäinen

(2)

Lappeenranta-Lahti University of Technology LUT School of Engineering Science

Computational Engineering and Technical Physics Computer Vision and Pattern Recognition

Denis Zavialkin

CNN-based ringed seal pelage pattern extraction

Master’s Thesis 2020

45 pages, 33 figures, 10 tables.

Examiners: Professor Heikki Kälviäinen

Associate Professor Konstantin Nadolin

Keywords: computer vision, image processing, pattern recognition, animal biometrics, pattern extraction, fur pattern, convolutional neural networks, Saimaa ringed seals

The topic of this thesis is inspired by the conservation efforts of Saimaa ringed seals, which are in danger of becoming extinct with no appropriate actions. The work aims to develop a fur pattern extraction framework to identify ringed seal individuals. This pattern is unique to each individual and as a result, it can be used for identification. In turn, the pelage pattern extraction algorithm is the key part of the identification and it enables, for example, seal counting and monitoring. The proposed seal pelage pattern extraction is based on UNet Convolutional Neural Network (CNN), which brings efficiency improvement to the developed pattern extractor compared to non-neural approaches. Also, the pipeline includes preprocessing stages such as tone mapping and cropping. More- over, this thesis contains a comparison of UNet to the other CNN-based method called DeepLab in terms of the stated challenge and an overview of sliding windows processing.

The proposed method showed Sørensen–Dice coefficient accuracy equals 0.55 vs 0.23 for the Sato filter based solution and 37% faster computation in the conducted test compared to the previously used non-neural solution in the re-identification pipeline.

(3)

I reckon this is a good place to remind yourself that you are you. This chapter is a possibility to go beyond the scope of a scientific language, style, and topic while being right inside it. I am happy to express acknowledgment to everybody who has been being with me this year, even at a distance. The energy of all my people directed towards me made it possible to realize the thing that is being finished by this thesis. Over the past eight months, I met the fifth season.

Ìíå êàæåòñÿ, ÷òî ýòî õîðîøåå ìåñòî, ÷òîáû íàïîìíèòü ñåáå, ÷òî òû ýòî òû. Ýòà ãëàâà - âîçìîæíîñòü âûéòè çà ðàìêè íàó÷íîãî ÿçûêà, ñòèëÿ è òåìû, íàõîäÿñü ïðÿìî âíóòðè íå¼. ß ñ÷àñòëèâ âûðàçèòü ñëîâà ïðèçíàòåëüíîñòè âñåì òåì, êòî áûë ñî ìíîé â òå÷åíèå ãîäà, äàæå íà ðàññòîÿíèè. Ýíåðãèÿ âñåõ ìîèõ ëþäåé, íàïðàâëåííàÿ â ìåíÿ, ñäåëàëà âîçìîæíûì âîïëîòèòü â æèçíü òî, ÷òî ñåé÷àñ çàâåðøàåò äàííàÿ ðàáîòà. Çà ïîñëåäíèå 8 ìåñÿöåâ ÿ âñòðåòèë ïÿòîå âðåìÿ ãîäà.

Minusta tuntuu, että nyt on hyvä hetki muistuttaa itseäsi siitä, että sinä olet sinä. Tämä kappale on mahdollisuus päästä pois tieteellisen kirjoittamisen kielestä, tyylistä ja ai- heesta, samanaikaisesti käyttäen niitä. Haluaisin ilmoittaa kiitollisuuteni kaikille niille, jotka ovat olleet kanssani tämän vuoden aikana, jopa etänä. Kaikkien teidän ihmisten energia on mahdollistanut tämän opinnäytetyön valmistumisen nyt. Viimeisen kahdeksan kuukauden aikana olen löytänyt viidennen vuodenajan.

Lappeenranta, May 25, 2020

Denis Zavialkin

(4)

LIST OF ABBREVIATIONS

CLAHE Contrast Limited Adaptive Histogram Equalization CNN Convolutional Neural Network

CRF Conditional Random Fields

CV Computer Vision

DCNN Deep Convolutional Neural Network DenseNet Densely Connected Convolutional Network FCNN Fully-Convolutional Neural Network PR Pattern Recognition

RBF Radial Basis Function SDC Sørensen–Dice Coefficien SPP Spatial Pyramid Pooling

SFTA Segmentation-Based Fractal Texture Analysis SVM Support Vector Machine

(6)

1 INTRODUCTION

1.1 Background

In contemporary history, there was a big amount of animal and plant species that have become extinct. The current population of Saimaa ringed seals is estimated a bit more than 400 [1] which means they are endangered according to International Union for Con- servation of Nature [2] having a high risk of extinction in the wild. The number of living seals is an indicator of the population state and its dynamics. Thus, scientists need a tool for seals re-identification to look after the population number tendency and conduct appropriate conservation actions.

It has been shown on the gray seal example that each seal has its non-repeating fur pattern [3]. This fact suggests that comparing pelage structure allows identifying already known individuals or distinguish different seals. Generally, seals are captured with the help of camera traps, which make images while they are lying on the coast. A common example of a seal image captured by a camera trap, as well as, a seal fur pattern is presented in Figure 1. Hoverer, such camera imaging produces the following connected challenges:

• Either weak or excessive illuminance.

• Low image quality that can be caused by a camera itself or an extremely high distance between a camera and a seal.

• Pattern might be hard to see because of dirt on a seal.

• Various positioning with respect to a camera may involve geometrical shape distortions.

Besides, scientists have to identify each seal manually, which is a time-consuming and difficult task. They label patterns with the help of graphical tools and compare them to each other just looking at the images. Such an approach can involve mistakes as hand labeling sometimes is not accurate and it is difficult to match each part of the pattern for every image to identify a seal. Thus, an approach that could help to overcome stated difficulties is needed. For example, the task of identification and pattern labeling might be automated with the application of Computer Vision (CV) methods.

(7)

This thesis focuses on automatic extraction of pattern, which should be superior in the following:

• Better time efficiency compared to manual labeling approach.

• More accurate results in pattern extraction and consequently in identification.

Figure 1.An example of a seal image captured by a camera trap.

The main concept of the developed algorithm is the application of Convolutional Neural Networks (CNN). This image processing type of networks allows analyzing the whole image and not separate pixels. That fact makes it stable for rotations or distortions described above while seeking the necessary objects. It is achieved by the application of the group of linear operators that consistently transform an image into a matrix where each element represent the class or the meaning of the corresponding pixel.

1.2 Objectives and delimitations

The goal of the pattern extraction is to receive a black and white binary image with the seal’s white spot contours separately from the other information. The example of the input and the output is shown in Figure 2 as well as the major stages of the suggested pipeline.

The thesis is a part of the CoExist [4] project that started with the purposes for seals identification. This identification framework is supposed to detect and to separate seals from the background, and then to extract their unique pattern and to identify them with

(8)

the help of the database of known seals. However, this thesis focuses on the extraction phase including the following objectives:

1. To collect the dataset of seal images and manually annotate it to train and evaluate the proposed method.

2. To develop an image preprocessing method to increase the accuracy of pelage pattern extraction.

Figure 2.Proposed method workflow.

Delimitations of the work are as follows:

• Seal detection, segmentation, and re-identification are not considered.

• The approach specializes in the Saimaa ringed seals only.

1.3 Structure of the thesis

In Chapter 2 of this thesis, general concepts of computer vision with application to animals are described. Works on the topic of animal biometrics for the past years are introduced in the same section. Next, Chapter 3 presents more specific approaches with neural network integration. Also, this part of the work is devoted more to the implementation and the inner structure of the similar segmentation pipelines and other connected frameworks suggested by other researchers. In its turn, Chapter 4 describes the developed pattern extraction technique with its details and exact way of implementation. Chapter 5 shows the results of the application and testing of the pattern extractor suggested in the thesis. The main results, answers to the stated research questions, and future plans are discussed in Chapter 6. Finally, Chapter 7 summarizes the main outcomes from the evaluation of the created fur pattern extraction algorithm.

(9)

2 ANIMAL BIOMETRICS

2.1 Automatic animal re-identification

The idea of animal identification is not novel. Human pursuit in certainty finds its incar- nation in aspiration to know everything exact. The exact number and animal identity are not an exception. Re-identification is no more than identification with the following comparison to the known list of the given instances. Depending on the species, the procedure may vary in its implementation.

For instance, Clemins et al. created an algorithm of elephant classification from the trum- peting sound record [5]. As it is mentioned, they said their method of audio analysis can allow specifying psychological or physiological state. An application of the hidden Markov chain helped them to reach an accuracy of 82.5% in identification which is quite unsatisfying. Besides, the proposed model requires extracted audio segments with the voice, which creates difficulties in the case of seal identification, as it is one extra stage for preparing the data compared to automatic photo traps.

Another method suggested by Mori et al. supposes using an improved technique of marking hedgehogs [6]. The results show that it allows receiving a radio signal successfully in a period of up to 9 months. This advanced tagging assumes increasing of layers for the colored tape, using certain spine areas, and reapplying the method for each time an animal was captured. Although this method is cheap and relatively easy to apply, it is the opposite of such big underwater animals like seals. Moreover, marking them might be considered as a not fully ethical way of identification.

It has been shown in [7] that a technique of pattern extraction may be successfully applied for identification. In this paper, Andrew et al. described a highly accurate approach for cow identification. The authors trained a Support Vector Machine (SVM) using radial basis function (RBF) kernels in order to teach the algorithm to remove extracted points with no information or those representing the tail or the head. The resulting matching process of meaningful features is shown in Figure 3 [7].

Bergamini et al. implemented a deep CNN (DCNN) for cattle re-identification [8] that uses two images: "frontal and one of the two sides". Performing the k-Nearest Neigh- bours (k-NN) made the classification easily tuned with the low number of parameters and unsupervised learning and highly scalable. Although it achieved accuracy in more than

(10)

80% there is a major drawback. The requirement of two images makes the approach weak in the case of the lack of data pluming overall performance that is shown in Figure 4.

Figure 3. Cow pattern feature matching [7].

Figure 4.Cattle identification: green is correct and red is misclassification. Modified from [8].

A paper by Brust et. al is devoted to the automatic identification of gorillas by the face [9].

An implementation of CNNs helped the authors to achieve 80.3% accuracy in the Top 5 set. They realized a pipeline that uses the YOLO model to find a facial region and AlexNet to extract its features. Finally, the features are classified by the SVM with the purpose of estimating the individual. The given algorithm is presented in Figure 5. The main outcomes from this work are that deep learning solutions might be highly efficient in biometrics and already pretrained models may sometimes be used with similar animal species.

(11)

Figure 5. Gorillas individual recognition pipeline [9].

2.2 Pelage pattern extraction

Taking into account that the human most developed information receiver is the eye, it is not surprising that we succeed in the image capturing and processing and information extracting. Now, it is one of the most reliable ways for data transferring and identification.

As an individual has its fingerprint, so other existing animals own specific features that differentiate each of them from the rest.

In their turn, Paterson et al. showed the extensive possibility of recognition of gray seal females with its pelage [3]. The authors declare that photo-based identity recognition is a suitable means for seal re-identification.

An example of pattern extraction is in Figure 6 [10]. The algorithm for cattle identification based on muzzle color K-means clustering was introduced. The authors describe pre-processing steps that include denoising and enhancement with the Contrast Limited Adaptive Histogram Equalization (CLAHE) algorithm. Similar to seals, it contains a unique structure that helps to identify an individual exactly.

Figure 6.Muzzle pattern extraction [10].

(12)

2.3 Saimaa ringed seal re-identification

The very first automated Saimaa ringed seal identification method was described in [11].

In this work, seals were segmented by application of unsupervised segmentation followed by the superpixel classification. Then an extracted seal image was decomposed on several binary images to calculate segmentation-based fractal texture analysis (SFTA) which is then utilized in a naive Bayesian classifier to match an individual to the known one.

Later, a more developed algorithm was introduced in [12]. Although the same segmentation workflow was implemented as in the previous paper, it was improved by the postprocessing and more advanced identification. For the identification phase, there were chosen and surveyed two existing methods: Wild-ID and HotSpotter. The results of the paper show that the HotSpotter outperforms the Wild-ID. The proposed workflow can be seen in Figure 7.

Figure 7.The method proposed in [12].

Then, Nepovinnykh et al. described two various approaches for seal identification which are based on CNN [13]. These pipelines were constructed with powerful well-known tools as the AlexNet CNN model and SVM in the role of a classifier. However, it still contains weak points. In spite of the fact that the best-achieved accuracy was declared to be around 91% it still can be improved. Generally, the stated accuracy was shown on the individuals that were represented by a bigger number of images.

Although segmentation played an important role in this paper, there is a big piece of un- necessary information that is still represented in images since the algorithm compares the

(13)

whole seals. For instance, the head and the tail are not desired to be present, as it gives no clue for identification. Besides, even parts of accidentally not segmented objects may appear in the image. Such examples are depicted in Figure 8 [13]. A consequent conclusion might be to additionally extract features that may help to identify seals carefully and at the same time to drop down the size of those features.

Figure 8. Extra information after segmentation: the left image contains the head and the tail, the right image contains the part of the background [13].

Later the same research group created a framework [14] which contains a separate pattern extraction stage compared to the solution in [13]. The main stages of the presented framework are shown in Figure 11. However, this thesis is aimed only at the pattern extraction phase in the given pipeline

Figure 9.Saimaa ringed seal re-identification algorithm [14].

(14)

The extraction is based on the Sato tubeness filter, which is suitable for continuous edges.

After the filter is applied, a sharping step is performed. Then, it was important to remove the borders left by the filter. This step in some cases may remove a part of a pattern too because if it is connected to the border, it is considered the same object. The example of faulty border removing is shown in Figure 10, there the upper part, which is bounded in red of the pattern, is missed. After removing the border, there is a morphologic opening with the noise removing purposes and one more Sato filter. Next, there are also applied histogram normalization, thresholding, an extra morphologic opening, and sharpening to highlight the obtained pattern. The workflow of the described extraction algorithm is in Figure 11.

Figure 10.Incorrect removing a part of the pattern that was connected to the border.

Figure 11. Saimaa ringed seal pattern extraction algorithm [14]. Two upper rows show the sequence from left to write as follows: original, Sato filter, sharpening, border removing, morphological opening and, Sato filter, histogram normalization, thresholding, morphologic opening, sharpening. The third row shows the original image and the result.

(15)

Furthermore, in this paper, the authors named the three most challenging tasks to be solved in Saimaa ringed seals pattern extraction. The fur pattern is subjected to a variety of conditions affecting the possibility of its successful extraction:

1. As the pattern is not uniform, it produces difficulties to extract all the parts that have different thicknesses and brightness.

2. The contrast between the pattern and other fur varies a lot depending on the individual and on the fact whether the pelage is dry or wet.

3. Camera traps are prone to have a low quality, which entails less information capturing.

Sometimes there were exceptional cases that might interrupt or totally confuse the classifier. Many of them occurred when algorithm segmented not the pattern but the whole fur structure as it is given below for original input and its processed output in Figure 12.

It is worth mentioning that the images were cropped to exclude meaningless parts of the information, which did not affect the result anyway.

Figure 12.Extreme missegmentation case. On the left is the original image and on the right is the automatic pattern extraction result.

With the aim to improve the pattern extraction in stated above weak points it might be a solution to introduce an algorithm that is based on neural networks. This approach showed a great performance in a variety of image processing tasks and challenges. Therefore, it might work with the Saimaa ringed seal fur pattern extraction.

(16)

3 CONVOLUTIONAL NEURAL NETWORKS FOR IM- AGE SEGMENTATION

3.1 Image segmentation

Image segmentation is a process where the goal is to separate an image on several regions by the given criteria. For instance, it could be a task of obtaining an image containing only the desired object neglecting the rest or the task of searching the boundaries of separate objects. This procedure might be done in automatic or manual ways. Manual implementation means labeling all the data by human and it takes a great amount of time to do such work. As the amount of information in the world increases exponentially, the second option, processing it by hands, becomes inefficient or even impossible. This is the reason for the development of an automatic image segmentation method.

Anjna and Kaur [15] defined the image segmentation as a tool for grouping image pixels into the objects with the same meaning and it serves for a big variety of real-life applications. By their claim, segmentation itself is performed by matching the attributes or features. The authors distinguish seven major image segmentation methods that are shown in Figure 13 [15].

Figure 13.Image segmentation methods [15].

(17)

Having numerous variations of approaches for image segmentation makes it difficult to choose the one. In order to decide about the method, it is useful to reduce the area of available options. In [16], the authors point out: "Semantic image segmentation is a key application in image processing and computer vision domain". Semantic segmentation is an automatic classification approach based on the meaning of the objects. That means they are separated corresponding to their semantic group. The example of semantic segmentation is demonstrated in Figure 14 [16].

Figure 14.Semantic image segmentation [16].

3.2 Convolutional neural networks

During the past years, neural networks have made a big step towards invading all areas of applied science. As it is the case with image segmentation. A variety of different methods for CNN-based image segmentation are presented in [17]. From this paper, it is clear that competition and development in this field are just about to start and each new approach becomes better than previous in accuracy, speed, or both. In conclusion, the authors infer that observed CNN-based methods are better in their performance compared to traditional approaches.

CNNs are a type of network appropriate for image analysis. In fact, CNN is a set of convolutional filters processing the image layer by layer where the final layer is the output with the "solution". The most commonly used layers are as follows:

1. Convolutional layers that transform each pixel by applying the given matrix (kernel) to it and its neighbors with the help of dot products. As a result, a new matrix is

(18)

created.

2. Activation layers are represented by non-linear functions that modify the response from a convolutional layer to help in feature selection and to decrease time spent on learning.

3. Pooling blocks that make the data smaller as it removes more general responses from the image that were learned by the previous layer. In other words, it decreases the array dimension by transforming a group of neuron responses into one, which is more representative.

4. A fully connected or a dense layer is the one that serves for connecting neurons that transforms a set of resulting matrices to some feature vector helping to make a decision. In a simple classification example, there is one fully connected layer at the end of the net, which creates a vector with probabilities of belonging to a certain class for each data sample.

Figure 15 shows the typical structure of a deep fully-convolutional neural network (FCNN) [17] which is a subtype of CNN. The FCNN implementation is a classical CNN that does not include any dense layers. That helps it to remove restrictions and solid requirements on the input image size. The output is a heatmap with the car density in the image.

Figure 15.Deep FCNN architecture modified from [17].

One of the remarkable CNN architectures for image segmentation is an encoder-decoder methodology. A network with the use of the encoder-decoder paradigm is presented in [18]. Its valuable concept is that for the challenging segmentation one needs first to collapse an image making it less and less. It is called downsampling. Such a procedure produces a small array that is helpful in classification purposes but not localization.

Therefore, when the decision is made then the recreating with the coming back to original size starts. During this phase, the information from the downsampling layer with the corresponding size is used, which is useful for accurate estimation of classes’ position

(19)

during the upsampling. It is achieved with bigger layers on each step. Figure 16 depicts the encoder-decoder pipeline using SegNet as an example.

Figure 16.SegNet architecture [18].

3.3 Semantic image segmentation

In 2017 a DCNN-based semantic segmentation architecture called DeepLab [19] was presented. As reported, this model mixes Spatial Pyramid Pooling (SPP) for enlarging

"the field-of-view of filters" and Fully-Connected Conditional Random Fields (CRF) to predict boundaries closer to the ground-truth. The SPP supposes parallel processing of variously scaled images on the same network structure. Such an approach allows avoiding cases when a network is well trained to find an object in one size but fails when it differs.

Fully-Connected CRF is an improvement of CRF that penalizes the node with the wrongly predicted labels and it takes into account color and position of pixels to set the penalty value. Example segmentation results are shown in Figure 17 [19].

Figure 17.DeepLab results [19].

(20)

Another model for semantic segmentation called UNet is described in [20]. It consists of a so-called contraction and expansion path that reminds the idea of the encoder-decoder architecture. The contraction path contains four stages of double 3x3 Conv and one 2x2 max pooling blocks each. The expansion path contains four steps of 2x2 Up-conv and two times of 3x3 Conv layers and one more with 1x1 conv instead of 2x2 Up-conv. The UNet incorporates various useful features that help it be faster and less demanding on the number of training samples.

The basic concepts applied in the UNet were augmentation and concatenation. The objec- tive of concatenation implementation is that it concatenates outputs and inputs of blocks in downsampling and to add it to the corresponding input in upscaling stages blocks. This approach gives more information for the localization, which is lost during the up convolution. Meanwhile, augmentation may serve to scale, rotate, or even corrupt training images with noise to generate more unique and difficult data. A common example of data augmentation is presented in Figure 18.

Figure 18.Augmentation example [21].

In [22] Jegou et al. claim that the majority of state-of-the-art semantic image segmentation methods are constructed of the same blocks. However, the recent development of Densely Connected Convolutional Networks (DenseNet) [22] brings new improvement to

(21)

this area. The main idea is to concatenate outputs and inputs of blocks in connect preced- ing layers to each layer inside the so-called Dense Block that are shown in Figure 19 [22]

as green rectangles. The sizes of the feature maps are equal within the same dense block so they can be concatenated with no effort. This causes the gathering of more and more channels at each layer inside a block. However, after each dense block transition blocks are placed. They serve to decrease the number of feature maps and change the size of those maps depending on whether it is transition up or down.

Figure 19.DenseNet architecture [22].

3.4 Segmentation of thin structures

In fact, thinking of the topic of this thesis as a task of animal or even just seal pattern extraction would be narrow. Compared to the size of seal, its circular fur inclusions might be considered as thin structures that are more native for biomedical machine vision. Thus, the idea of implementation might find its origin in other fields.

The software architecture for eye blood vessel segmentation has been described in [23].

The approach from this report repeats DenseNet and UNet ideas by combining the images

(22)

from input and output. In this case, it is made as bottom-top connections to give information for cleaning results in high-level side-outputs and top-bottom connection bring information for decreasing noise in low-level side-outputs. The suggested method is presented in Figure 20 [23] where the superposition of results from different layers corresponding to their weights is applied. Such a structure gave leadership to this technique in many data sets comparing F1-score and Matthews correlation coefficient for other state-of-the-art approaches.

Figure 20.BTS-DSN structure [23].

(23)

4 CNN FOR SEAL PATTERN EXTRACTION

This chapter proposes two ringed seal pattern extraction methods based on the concepts and architectures of DeepLab and UNet. These two referenced networks were chosen, as they seem to fit the pattern extraction task most. Both of the implementations use Python and its libraries: TensorFlow and Keras.

4.1 Pattern extraction pipeline

The proposed algorithm is typical for many neural network solutions. The workflow of the stated phases is shown in Figure 21. It starts with an input image and finished by the resulting binary image of the predicted pattern.

Figure 21.The main stages of the proposed pipeline.

(24)

Namely, it can be separated into three global steps that are responsible for certain pro- cesses:

• Preprocessing aims to make input seal images suitable for the network and tunes images to highlight features that are crucial for the successful task completion.

• Neural network-based feature selection performs the main part of the algorithm which is the extraction of the fur pattern.

• Postprocessing is responsible for adopting and modification of the network output to help in different future tasks such as visual evaluation or processing by the later stages of the re-identification.

4.2 Preprocessing

Special stages are required to feed the data to the network depending on their type. Those requirements include input size, different classes labeling, or image formats. Neverthe- less, many of the required changes do not affect the result a lot. However, tone mapping [24] has been shown to affect the segmentation result. There was a group of images where the brightness was shifted and as a result, the pattern color and tone were different from what the network expects. Application of the perceptual framework proposed by Mantiuk et al. corrected contrast in faulty images depicted in Figure 22. Finally, it al- lowed the network to see more details in the pattern even when it was learned on non-tone mapped images.

Figure 22. Tone mapping example: the original image (left) and the one corrected by the framework (right).

Besides, suggested preprocessing includes image cropping. The idea arose, as the majority of the image area contain no seal but the black background. The procedure lies in leaving the minimum possible rectangle containing a seal, its example can be seen in

(25)

Figure 23. This solution should potentially make the algorithm faster and less demanding in the sense of computational resources.

Figure 23.Cropping example: the original image (left) and the one cropped to a square (right).

Networks require images in certain sizes so they are to be scaled. There are two possible options:

• Sliding window recognition, which is the procedure of splitting an image into subimages, processing them separately, and fusing them back.

• Slight resizing preserving the initial width to height ratio, which is practically adding an extra background to achieve the desired size ratio and then resizing.

Although this approach seems less reliable, it should be faster compared to the previous one since it has to process fewer images.

4.3 DeepLab based approach

As was discussed before, the main feature of the DeepLab is atrous spatial pyramid pooling, which greatly helps in object separating. However, for the pattern extraction application, DeepLabv3+ [25] was used. In addition to the pyramid structure, the encoder- decoder paradigm was implemented in this version which is shown in Figure 24. The implementation requires images with the same width so any sample might be rescaled evenly and be used with the original size ratio. Also, it takes in RGB images which

(26)

is useful for more efficient undesired objects detection and exclusion such as grass and rocks.

Figure 24.DeepLabv3+ model. Modified from [25].

4.4 UNet based approach

One more option was to bring the UNet structure to the project. Previously it was shown that this network structure significantly outperforms other network types in thin objects’

segmentation. The network merges the information about the location from the downsampling path with the contextual information in the upsampling steps as shown in Figure 25 [20] to obtain general information, which is necessary to predict a good segmentation map.

(27)

Figure 25.UNet model [20].

Besides, such an effect has been achieved due to augmentation. Augmentation is a process when the input and desired example output are changed simultaneously making a new pair of data to learn. That specific process modification includes a positive outcome:

this architecture does not require a huge number of samples for learning. That fact is extremely convenient in terms of this thesis work as all the learning data is manually annotated and is not numerous. In addition, it is helpful not only in the case of lack of data but to confuse the network and make it watch different parameters. It can be explained as, for instance, rotation or compression of the image change it drastically so the forms or regularities learned before are no longer valid and the network should search for more common and general features.

4.5 Postprocessing

Since the input was modified during the preprocessing and extraction stages, it should be restored to have the same properties as it had before. Moreover, sometimes it is convenient to have images in a representation that can be visually examined and processed later with no matter by human or machine.

For these purposes, the network output images are inverted in the grayscale color space in the case of the UNet. In addition, what concerns sliding windows data, they were restored

(28)

using a morphological union, as the background is black, meaning zero values. In that case adding will result in the pattern to be presented if one of the windows contained it.

One more small but valuable postprocessing step is cropping. The crop is a cutting of an image to the rectangle containing the pattern. That is the way how image sizes both geometrical and disk allocated space are decreased. Such a simple solution brings faster working velocity and fewer resources consuming to the further re-identification stages.

(29)

5 EXPERIMENTS

5.1 Data

For this master thesis, a Saimaa ringed seal RGB image database subset, which consists of 520 images split by individuals and contains already segmented seals from the background, was used. The set was processed by the Sato filter based method described in [14]

which performs seal segmentation from the background. Then those images were divided into two sets for testing and training purposes, depending on how good the result of the certain image from the script was. The images with unsatisfactory extraction made by the Sato filter based method were taken to be testing samples whereas others represented a database for training. The sizes of the sets were 271 and 249 images for testing and training stages respectively.

Next, one of the most time-consuming parts comes. All training images were manually annotated by the author of this thesis in order to achieve better training results and quanti- tative comparison of the results. An example of an image from the dataset and its manual labeling is depicted in Figure 26. Annotation was made with the reference to original images using the graphical image manipulation tools. In other words, script-segmented pattern images from the training subset were corrected to be as much close to the pattern it should have been as possible. Also, ten random images from the test set were fully manually annotated to make a ground truth basis for calculating and comparing the performance in various experiments.

Figure 26. Example of the data used in the thesis: original image from the dataset (left) and manual annotation (right).

(30)

5.2 Evaluation criteria

In order to estimate the accuracy of the created pipeline two criteria were used. The first was the Jaccard index, which can be calculated as

J = |X∩Y|

|X∪Y| = |X∩Y|

|X|+|Y| − |X∩Y| (1)

where X and Y are the regions to be compared. The Jaccard index is a common approach in computer vision and data analysis applications for numerically measuring how similar two regions are. In terms of this thesis, notation|X|means the number of white pixels in X. The example of regions used in the evaluation shown in Figure 27.

(a) (b)

(c) (d)

Figure 27. Example of regions used in evaluation. (a) Predicted patternX; (b) Ground truthY; (c)X∩Y; (d)X∪Y.

Next was the F1 score, which is widely used for performance evaluation in computer

(31)

vision challenges. In this work, the Sørensen–Dice coefficient (SDC) variation of F1 score was applied that is defined as

SDC = 2|X∩Y|

|X|+|Y| (2)

where X and Y represent the areas, one checking for similarity. It differs from the Jaccard index, which takes into account true positive values only once in both the numerator and the denominator. SDC is a similarity coefficient that takes values from 0 to 1, which can be simply converted into percent values if it is needed.

5.3 Description of experiments

The main idea of the experiments was to take the set of ground truth images that are made manually and check the output corresponding to the evaluation criteria (Equations 1 and 2). Experiments differ in applied CNN architectures, their tune, and input used for training and testing. For each experiment in Table 1 a new training was conducted separately, which means learning results from different experiments do not affect each other.

Experiments were run on two systems: a remote server with NVIDIA Tesla V100 and a laptop with NVIDIA GeForce GTX 1050 Ti. Computational environment affects time spent for several stages only, that is why the results and their discussion are not focused on the equipment used in experiments.

Table 1.Experiments conducted during the Master’s thesis.

Net type Image size Main idea

1 DeepLab 512 Compare performances of the network to Sato method 2 DeepLab 256 Invert object of interest to look for a background 3 UNet 256 Compare performances of the network to Sato method 4 UNet 512 Compare performances of the network to Sato method 5 UNet 512 Using windows to process image parts separately 6 UNet 1024 Check if a bigger image size gives better results

7 UNet 512 Application of overlapping windows

8 UNet 512 Segmentation of windows scaled from 1024 to 512 square images 9 UNet 512 Comparison of tone tuned images to basic ones

10 UNet 256 & 512 Execution time measurement

(32)

5.4 Results

5.4.1 DeepLab

Although the DeepLab architecture was quite convenient in sense of input data requirements and output data form, it did not produce the results it was expected to. In Ex- periments 1 it riched a less average accuracy compared to the Sato filter based method (Table 2). The Jaccard index was approximately 60% less which is hardly satisfactory. A common example of the output from DeepLab based solution is presented in Figure 28.

Table 2.Accuracies comparison of images produced by DeepLab in Experiment 1.

Jaccard SDC

Image Sato DeepLab Sato DeepLab

Test image 1 0.374923 0.097826 0.545373 0.178218 Test image 2 0.538562 0.413014 0.700085 0.584585 Test image 3 0.278094 0.196883 0.435170 0.328993 Test image 4 0.333141 0.150223 0.499784 0.261207 Test image 5 0.256488 0.154484 0.408262 0.267624 Test image 6 0.236475 0.072657 0.382498 0.135471 Test image 7 0.391937 0.201268 0.563153 0.335093 Test image 8 0.361411 0.277803 0.530936 0.434813 Test image 9 0.244581 0.329778 0.393034 0.495990 Test image 10 0.225067 0.110158 0.367436 0.198455 Average 0.3240679 0.2004094 0.4825731 0.3220449

Figure 28.DeepLab results in Experiment 1: the input (left) and the separated pattern (right).

(33)

Experiment 2 was conducted because the rings in patterns found by the network were filled even if they should not be. Therefore, the decision was to try to segment not the pattern but the background. Results of this experiment are presented in Table 3 and Figure 29.

Table 3.Accuracies comparison of images produced by DeepLab in Experiment 2.

Jaccard SDC

Image Sato DeepLab Sato DeepLab

Figure 29.DeepLab results in Experiment 2: the input (left) and the separated background (right).

5.4.2 Unet

The method that was based on the UNet performed more accurately compared to the DeepLab based approach. It is clear from Table 4 and Table 5 which contain the exact

(34)

values of the accuracies calculated with respect to the ground truth images in Experiments 3 and 4 respectively. Figure 30 depicts one of the comparison sequence samples from the evaluation set where 512x512 image size was used.

Table 4.Accuracies comparison of images produced by UNet in Experiment 3.

Jaccard SDC

Image Sato UNet Sato UNet

Jaccard SDC

(35)

Figure 30.From left to right: ground truth, Sato-based method, proposed UNet-based method.

From the average accuracies in the last table, it was calculated that the accuracy gain was 47% for the Jaccard and 32% for the Sørensen–Dice criteria vs 36% and 30% growth in the experiment with 256-sized images. The values from Experiment 4 are counted reference ones for further experiments as the accuracy values for the Sato based method vary depending on the required image property transformation in the experiments.

Next, there are results obtained with an application of approach where each image was divided into windows and then restored to full image size (Experiment 5). Accordingly to Table 6, in this experiment, the increase in accuracy was 30% in Jaccard and 23% in SDC metrics compared to the Sato method. In addition to accuracy declining, the overall execution time rose because the number of images became bigger.

Jaccard SDC

(36)

Experiment 6 resulted in increasing the time and resources required for the learning stage, however, it mostly did not affect the accuracy.

Figures 31 and 32 present the example from Experiment 7 where overlapping windows had common areas to process. This approach was tested since it was noticed that sometimes a simple windows approach might lose patterns near the boundaries of the windows.

As Table 7 shows, such a way to solve the stated problem did not work as assumed. It shows the same accuracy percentage but with a higher load as the amount of image to process grew from 160 to 640.

Jaccard SDC

Figure 31.Splitting test image into windows.

(37)

Figure 32.Window processing result of the test image.

Experiments 8 supposes selecting bigger windows with further scaling to smaller ones.

The consequences of such an approach include lowering training time due to fewer images number compared to previous windows approaches. However, as it follows from Table 8, it produces less accurate results.

Experiments with windows usage give a reason to deny the idea about separating an image on subimages and processing it independently. The majority of the test images in these experiments lacked pattern intensity or even pattern itself. Such behavior might be explained by the fact that the network learns precise and specific features but not the general ones. As a result, if pattern size or scale changes, for instance, due to distance from the camera to a seal, the network does not recognize that fur pattern.

Jaccard SDC

(38)

Finally, Table 9 represents the accuracy achieved by the proposed pipeline (Experiment 9).

Comparison example images for Test image 5 is shown in Figure 33. This configuration showed 183% more accuracy in the Jaccard metric and an extra 140% in SDC.

Figure 33.From left to right: ground truth, Sato based method, proposed UNet-based method.

Table 9.Accuracies comparison of tone mapped images in Experiment 9.

Jaccard SDC

As a conclusion, the time measurement for the main experiments was conducted. During this experiment were used the ten test images already preprocessed for the extraction phase. Its outcome is presented in Table 10.

(39)

Table 10.Execution time spent for pattern extraction in seconds.

Method Sato UNet (256) UNet (512) UNet (512 windows) UNet (512 overlapping windows)

Number of images 10 10 10 160 640

Time 12.7 4.1 8.0 46.0 157.6

(40)

6 DISCUSSION

6.1 Current study

The thesis presents the pipeline aimed at Saimaa ringed seal pattern extraction. It utilizes the UNet segmentation model and includes preprocessing stages to improve accuracy.

Tone mapping demonstrated its investment to overall accuracy. In addition, the results of the thesis show the contribution of CNNs in pattern segmentation tasks. The achieved average SDC accuracy in 0.55 compared to the Sato filter based solution that gave 0.23 is an indicator of the reliability of the developed method that will be included in the Saimaa ringed seal re-identification algorithm. That is, the accuracy increase is0.55/0.23 = 2.39 times, which is 139% more than before for the SDC-based evaluation and 0.38/0.13 = 2.92times equal to the growth by 192% for the Jaccard metrics.

Although the DeepLab has shown drastically good results in semantic image segmentation described in [19], it did not work well in terms of the current task. The reason it fails the fur extraction challenge is that the network was constructed and pre-trained to find out simply-connected domains. In contrast, the pelage fur pattern is usually a multiply connected region. This fact might be proven by the recognition samples in Figure 28 where small toruses are filled inside as they have been circles.

In contrast, the UNet demonstrated good results even when it had been learning for a small number of epochs. Better preparation represented by preprocessing that consists of cropping and tone mapping helped to improve already robust performance. Surprisingly, but windows splitting did not produce expected accuracy as it was assumed. It seems, that such enlarging of a pattern does not help because a pattern is a continuous structure that should not be broken as in that way the network loses pixels’ context. Also, for 512 sized images, UNet based method showed that it has

t_Sato−t_{U N et} t_Sato

∗100% =

12.7−8

12

∗100% = 37% (3) less execution time which is crucial when working with big amounts of data.

(41)

6.2 Future work

As an aim for the future in order to improve performance more might be considered processing of bigger amount of images for learning. Such action may be extremely helpful if these extra images will be difficult to segment, as an example, images with grass or those where a seal was segmented not accurately. This aim is achievable because the Saimaa ringed seal dataset contains more images than what was used for the current work.

The main drawback is that this approach demands a lot of manual work, which is time- consuming.

Besides, this thesis does not incorporate any investigation on modification CNN layers.

Undoubtedly, better adoption of the network structure to the current need may have a positive effect on the overall accuracy. Possibly, there are or will be soon introduced more suitable state-of-the-art algorithms and solutions for image processing, which could help to improve the result achieved described in this work.

One more aspect to be solved in the future is an integration of the proposed solution into the re-identification framework. It is expected to bring better accuracy in matching individuals as well as cutting down the time spent on computations performing. In addition, as Saimaa ringed seals are not the only animals with such fur structure it seems reasonable to try to apply the proposed method onto other species. It first can be tested on some other kinds of seal and if it works in the same way then it probably might work with something like leopards or giraffes.

(42)

7 CONCLUSION

In this thesis, the fur pattern extraction task in the area of animal biometrics was considered. Also, the review of connected topics and approaches was made to find the ideas and direction for the development of the pattern extractor. Comparison, combination, and improvement of state-of-the-art approaches such as UNet and DeepLab were made in order to perform the task.

The proposed algorithm includes three stages: preprocessing, segmentation, postprocessing. Preprocessing incorporates two main ideas: removing of undesired regions and high- lighting the target object. These ideas were realized by the application of scaling and rearranging of intensity. Segmentation is served by the CNN based on the UNet architecture. It takes in an image of a seal segmented from the background and produces a binary mask of the pattern is represented in black and the rest is white. Finally, color inversion and one more cropping stage represent a postprocessing stage.

It has been shown that the proposed method outperforms previously used the Sato filter- based method. The accuracy that was estimated by the Sørensen–Dice coefficient in- creased by 139% and by 192% for the Jaccard index. Estimation was held on ten difficult to segment images that contain either grass, complicated luminance distribution, or low quality of the image. Finally, this method speeds up the extraction phase by 37% due to the neural network-based architecture.

(43)

REFERENCES

[1] Saimaa ringed seal. https://wwf.fi/en/saimaa-ringed-seal/. Ac- cessed: 2020-01-27.

[2] Tero Sipilä. Pusa hispida ssp. saimensis, saimaa seal, 2015.

[3] William Paterson, Paula Redman, Lex Hiby, Simon Moss, Ailsa Hall, and Patrick Pomeroy. Pup to adult photo-id: Evidence of pelage stability in gray seals. Marine Mammal Science, 29:537–541, 2013.

[4] Coexist project. http://www2.it.lut.fi/project/coexist/index.

shtml. Accessed: 2020-05-08.

[5] Patrick Clemins, Michael Johnson, Kirsten Leong, and Anne Savage. Automatic classification and speaker idenfication of african elephant (loxodonta africana) vo- calizations. The Journal of the Acoustical Society of America, 117:956–63, 2005.

[6] Emiliano Mori, Mattia Menchetti, Sandro Bertolino, Giuseppe Mazza, and Leonardo Ancillotto. Reappraisal of an old cheap method for marking the european hedgehog. Mammal Research, 2015.

[7] Andrew William, Hannuna Sion, Campbell Neill, and Burghardt Tilo. Automatic individual holstein friesian cattle identification via selective local coat pattern matching in rgb-d imagery. InIEEE International Conference on Image Processing (ICIP), pages 484–488, 2016.

[8] Luca Bergamini, Angelo Porrello, Andrea Dondona, Ercole Negro, Mauro Matti- oli, Nicola D’Alterio, and Simone Calderara. Multi-views embedding for cattle re-identification. In 14th International Conference on Signal-Image Technology Internet-Based Systems (SITIS), pages 184–191, 2018.

[9] Clemens Brust, Tilo Burghardt, Milou Groenenberg, Christoph Kading, Hjalmar S. Kühl, Marie L. Manguette, and Joachim Denzler. Towards automated visual monitoring of individual gorillas in the wild. InIEEE International Conference on Computer Vision Workshops (ICCVW), pages 2820–2830, 2017.

[10] Santosh Kumar, Sanjay Singh, Ravi Singh, Amit Singh, and Dr. Shrikant Tiwari.

Real-time recognition of cattle using animal biometrics.Journal of Real-Time Image Processing, 13:1–22, 2016.

[11] Artem Zhelezniakov, Tuomas Eerola, Meeri Koivuniemi, Miina Auttila, Riikka Lev- änen, Marja Niemi, Mervi Kunnasranta, and Heikki Kälviäinen. Segmentation of

(44)

saimaa ringed seals for identification purposes. InAdvances in Visual Computing, pages 227–236, Cham, 2015. Springer International Publishing.

[12] Tina Chehrsimin, Tuomas Eerola, Meeri Koivuniemi, Miina Auttila, Riikka Levä- nen, Marja Niemi, Mervi Kunnasranta, and Heikki Kälviäinen. Automatic individual identification of saimaa ringed seals. IET Computer Vision, 12(2):146–152, 2018.

[13] Ekaterina Nepovinnykh, Tuomas Eerola, Heikki Kälviäinen, and Gleb Radchenko.

Identification of saimaa ringed seal individuals using transfer learning. InAdvanced Concepts for Intelligent Vision Systems, pages 211–222. Springer International Pub- lishing, 2018.

[14] Ekaterina Nepovinnykh, Tuomas Eerola, and Heikki Kälviäinen. Siamese network based pelage pattern matching for ringed seal re-identification. InWinter Conference on Applications of Computer Vision Workshop, 2020.

[15] Er. Anjna Anjna and Er.Rajandeep Kaur. Review of image segmentation technique.

In International Journal of Advanced Research in Computer Science, volume 8, pages 36–39, 2017.

[16] Xiaolong Liu, Zhidong Deng, and Yuhan Yang. Recent progress in semantic image segmentation. Artificial Intelligence Review, 52(2):1089–1106, 2019.

[17] Vishwanath A. Sindagi and Vishal M. Patel. A survey of recent advances in cnn- based single image crowd counting and density estimation. Pattern Recognition Letters, 107:3 – 16, 2018.

[18] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pat- tern Analysis and Machine Intelligence, 39(12):2481–2495, 2017.

[19] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 40(4):834–848, 2018.

[20] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), pages 234–241, Cham, 2015. Springer International Publishing.

[21] Data augmentation for deep learning. https://towardsdatascience.

com/data-augmentation-for-deep-learning-4fe21d1a4eb9. Ac- cessed: 2020-05-10.

(45)

[22] Simon Jegou, Michal Drozdzal, David Vazquez, Adriana Romero, and Yoshua Ben- gio. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. InConference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017.

[23] K. Hu, Z. Zhang, X. Niu, Y. Zhang, C. Cao, F. Xiao, and X. Gao. Retinal vessel segmentation of color fundus images using multiscale convolutional neural network with an improved cross-entropy loss function. Neurocomputing, 309(2):179–191, 2018.

[24] Rafal Mantiuk, Karol Myszkowski, and Hans-Peter Seidel. A perceptual framework for contrast processing of high dynamic range images. ACM Transactions on Applied Perception, 3(3):286–308, 2006.

[25] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. InEuropean Conference on Computer Vision (ECCV), pages 833–851, Cham, 2018. Springer International Publishing.

CNN-based ringed seal pelage pattern extraction