Large Scale Image Retrieval of Small Objects

(1)

LARGE SCALE IMAGE RETRIEVAL OF SMALL OBJECTS

Faculty of Information Technology and Communication Sciences Master of Science Thesis April 2020

(2)

ABSTRACT

Jesper Granat: Large Scale Image Retrieval of Small Objects Master of Science Thesis

Tampere University Information Technology April 2020

Image retrieval is a classic problem in computer vision. The goal is to find relevant images from a database given an example of a relevant image. This thesis focuses on a specific subset of image retrieval where the objects in the image can be small thus making image retrieval difficult due to background clutter. The most widely used methods in the literature are developed for images of buildings covering most of the image making the retrieval task easier than using small objects with varying backgrounds. Due to the small size of the object, its retrieval with existing methods can be difficult. Additionally, most image retrieval benchmarks use manually annotated bounding boxes for the query object.

This problem is approached with an object detection approach. More specifically, since no more training data is desired, weakly supervised object detection is used to determine regions in the image where the retrieval object lies. This means that the same training data can be used for training in the object detection and in the retrieval phases. Additionally, no additional annotation is needed, as all can be done with image-level labels.

The empirical results of the method are studied from two points of view. First, to what extent can small object retrieval be improved by locating the object in images before retrieval, and to what extent can the manual annotation be used in the most common benchmarks be automated with weakly supervised object detection. The results show that the weakly supervised object detection improves slightly the image retrieval results for small objects from similar viewpoints but does not significantly improve results for large objects. On the other hand, using an object detector is beneficial compared with using the whole image when cropping at all is beneficial.

Keywords: Image retrieval, weakly supervised object detection, feature extraction, deep neural networks

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

TIIVISTELMÄ

Jesper Granat: Suuren mittakaavan kuvahaku pienille kappaleille Diplomityö

Tampereen yliopisto Tietotekniikka Huhtikuu 2020

Kuvahaku on klassinen ongelma konenäössä. Kuvahaun tavoite on löytää osuvia kuvia tietokannasta perustuen esimerkkikuvaan halutusta kappaleesta. Tämä työ keskittyy kuvanhaun osajoukkoon, jossa kuvan kappaleet voivat olla pieniä, jolloin kuvahaku voi olla vaikeaa taustan sekaisuuden takia. Kirjallisuuden käytetyimmät menetelmät käyttävät kuvia rakennuksista, jotka tekevät hausta helpompaa verrattuna pienten kappaleiden hakemiseen, joilla on sekalainen tausta. Kappaleen pienen koon takia sen hakeminen nykyisillä menetelmillä voi olla vaikeaa.

Lisäksi suurin osa kuvahaun evaluointiprotokollista käyttää esimerkkikuvaan käsin annotoituja alueita kappaleen rajaamiseen ja haun helpottamiseen.

Tätä ongelmaa lähestytään kappaleentunnistuksella. Heikosti valvottua kappaleentunnistusta käytetään määrittämään kuvasta alueita, joissa haettava kappale sijaitsee, sillä enempää opetusmateriaalia ei haluta käyttää. Tämä tarkoittaa, että samaa opetusmateriaalia voidaan käyttää sekä kappaleentunnistuksen että kuvahaun kouluttamiseen. Tämän lisäksi mitään muita annotaatioita ei tarvita, sillä kaikki koulutus voidaan tehdä kuvatason annotaatioilla.

Menetelmän empiirisiä tuloksia tutkitaan kahdesta näkökulmasta. Ensimmäiseksi tutkitaan, voidaanko pienten esineiden hakua parantaa etsimällä haettava kappale esimerkkikuvasta tai tietokantakuvista ennen kuin sitä yritetään hakea. Toinen näkökulma on, voidaanko yleisesti käytössä olevaa evaluointiprotokollaa automatisoida löytämällä haettava kappale kuvasta kappaleentunnistuksen avulla. Tulokset osoittavat, että heikosti valvottu kappaleentunnistus parantaa hieman tuloksia pienille kappaleille, jotka on kuvattu samasta kuvakulmasta, mutta ei merkittävästi paranna kuvanhakutuloksia suurille kappaleille. Toisaalta kappaleentunnistuksen käyttäminen on hyödyllistä verrattuna koko kuvan käyttämiseen haussa silloin, kun kuvan rajaaminen ylipäätänsä on hyödyllistä.

Avainsanat: Kuvahaku, heikosti valvottu kappaleentunnistus, piirteenirrotus, syvät neuroverkot Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck -ohjelmalla.

(4)

PREFACE

The topic of this thesis was decided together with my employer, namelyWapice Ltd. The goal was to investigate methods and literature on image retrieval that could be utilized in a wide array of applications of computer vision. An especially interesting case was retrieving small objects on cluttered backgrounds in large databases which ended up being the topic of the thesis almost word for word.

I would like to thank Ilari Kampman from Wapice for helping me refine the topic and for supporting me in the research and implementation processes. I am grateful for my colleagues at Wapice for giving me feedback and pointers that immensely helped me with this thesis. I am equally thankful for my supervisor Esa Rahtu for providing me with generous guidance throughout the thesis and sharing his passion for computer vision.

Tampere, 7th April 2020 Jesper Granat

(5)

LIST OF FIGURES

1.1 Potential retrieval results for different queries. The top row shows how the current retrieval methods perform. The second row shows a potential retrieval results with a current method. The last row shows the correct retrieval results and the goal of this thesis. . . 1 2.1 Fully connected and convolutional neural networks side by side [25]. In a

fully connected network, every neuron is connected to every neuron in the next layer. Conversely, in a CNN, there is a fixed number of connections and that number depends on the kernel size. . . 4 2.2 Convolutional block [25]. Each convolutional kernel slides over the input

layer and outputs the result of the convolution to the corresponding place on the output convolutional block. The process is repeated for every input layer. . . 6 2.3 ResNet-18 architecture [63, Ch. 7]. The ResNet-50 architecture follows

this same idea but is deeper. . . 7 2.4 Residual block for ResNet architectures [63, Ch. 7]. The left block is a

regular block and the right is a residual block. The residual block shows the residuality of the network, where the input to the block is again connected to the output. . . 10 2.5 Overview of a generic image retrieval pipeline. The service provider has

initially a group of unorganized images, from which features are extracted.

The user can then search from this database with an example image.

Features are extracted from the example image similarly and the features in the database that are closest to the example image correspond to the retrieved images. . . 15 2.6 Example of CAM based weakly supervised object detection. Detection is

based on convolutional activations, which are represented by colors. The bounding box is also visualized. . . 17 3.1 Example of a pipeline where regional search is used in image retrieval [53].

Yellow rectangles denote the detected area and blue starts denote local feature locations. . . 23 4.1 The main innovation behind the proposed method. The goal is to limit the

extent of the image from which features are extracted. This is done using weakly supervised object detection to localize the object in the image first, then extract features. . . 28

(7)

4.2 The overview of the proposed method pipeline. The descriptor for each image is constructed by first detecting a bounding box for the object and cropping the image. Then, a recent feature extraction pipeline of convolutional neural networks, pooling, normalization, and whitening is used. The bottom part of the image shows the two training methods for this method. The weakly supervised object detection uses a classification loss on a regular classifier network and the feature extraction uses Siamese learning and ranking loss. . . 29 4.3 ADL module [6]. The dropout is controlled by the attention to spread the

heatmap over the whole object. This layer is put in multiple positions in the CNN. . . 30 4.4 Bounding box generation. The heatmap is thresholded, contours found,

and a bounding box is drawn around the largest contour. . . 32 4.5 Alternative activation maps for other classes. As can be seen, these can

be used to find different bounding boxes. . . 32 4.6 Radenovic feature extraction and image retrieval pipeline [42]. The upper

part shows the generic pipeline of the method and the lower part shows the method during training. . . 33 5.1 All datasets used in this thesis. The column on the left presents the dataset

name and all successive column present the information about the retrieval dataset. The example images are from the respective retrieval datasets. . . 38 5.2 Feature extractor learning curves for Instre and Retrieval-SfM datasets.

The loss is shown on the vertical axis and the step on the horizontal axis.

Instre converges faster than Retrieval-SfM. . . 41 5.3 The effect of the bounding box size on retrieval performance. If the

bounding box is too small, the retrieval performance starts to decline.

These curves can be used to estimate the minimum percentage if the image that needs to be covered in order for retrieval to be successful. . . . 43 5.4 The effect of the CAM threshold on coverage and background. These

measure how much of the ground truth bounding box is covered and how much background has been able to be removed respectively. The number along the curves denote the CAM thresholds. . . 44 5.5 A t-SNE visualization of the feature space constructed of the ROxford

dataset. The goal is to have similar images close together and different images far away. . . 47 5.6 Hard negative distances before and after training. The middle figure shows

the distribution after image net classification training and the last figure shows the after the distribution after retrieval fine-tuning. . . 47 5.7 Qualitative examples of image retrieval when the detector has no

significant impact on the retrieval results. White rectangle shows the predicted detection. . . 48

(8)

5.8 Qualitative examples of image retrieval when the detector has a significant impact on the retrieval results. White rectangle shows the predicted detection. . . 48

(9)

LIST OF TABLES

5.1 Retrieval results with the mAP measure. The baseline is compared against methods that construct both, the database descriptor and query descriptor from detected regions. . . 45 5.2 The effect of the query bounding box for image retrieval in terms of mAP.

For smaller objects, the retrieval improves significantly with a bounding box but for large objects the results might worsen. . . 45 5.3 Retrieval results with mAP at ranks 1, 5 and 10. The results significantly

improve compared with mAP. . . 46

(10)

LIST OF SYMBOLS AND ABBREVIATIONS

ADL Attention-based Dropout Layer

Batch Normalization A method of normalizing every batch between convolutional blocks

BoW Bag of Words

CAM Class Activation Map

CNN Convolutional Neural Network

Dropout A mechanism for avoiding overfitting in neural networks

Feature Extraction A process of finding unique representation for the input, e.g. an image

GeM Generalized Mean

MAC Maximum Activation of Convolutions

PCA Principal Component Analysis

ReLU Rectified Linear Unit

ResNet A residual deep CNN model

RPN Region Proposal Network

SIFT Scale-invariant Feature Transform

SMK Selective Match Kernels

SPoC Sum Pool of Convolutions

SURF Speed Up Robust Features

VLAD Vector of Local Aggregated features WSOD Weakly Supervised Object Detection

(11)

1 INTRODUCTION

Image retrieval is a fundamental and well-known problem in computer vision. The goal is to find relevant images based on an example of a relevant image. The applications of such system include, among others, reverse image search and finding the particular object from a web shop based on an image of that object. This concept is shown in Figure 1.1. There are several applications for image retrieval both in industry and in consumer software. One of the modern applications is in Google reverse image search, where a user can search the internet for similar images to the query image. Another recent application is the Merlin Bird ID mobile application where a user can upload an image of a bird and the application will attempt to predict its taxonomy and show similar birds¹. In the industry, image retrieval can be applied to, for example, detecting detecting copyright infringements².

Query image Retrieval results

Figure 1.1. Potential retrieval results for different queries. The top row shows how the current retrieval methods perform. The second row shows a potential retrieval results with a current method. The last row shows the correct retrieval results and the goal of this thesis.

The goal of this thesis is to design and implement a robust method for large scale

1https://merlin.allaboutbirds.org/

2https://www.pixsy.com/

(12)

particular image retrieval, especially for small objects in cluttered scenes. Hence, recent advances in feature extraction, transfer learning, and weakly supervised object localization are used in the proposed method. The aim of the thesis is presented in Figure 1.1. In the second row of the image, the method is distracted by the hand and the background that can be prevalent if the retrieval object is small. In the last row, this has been corrected. In this study, convolutional neural networks are used to find bounding boxes for retrieval objects, as well as extracting feature vectors from the detected bounding boxes and by adding post-processing, relevant images can be located. Recent work has focused on making the feature vector as powerful as possible and advances have been made in transfer learning of neural networks. Additionally, much improvements have been seen in the area of weakly supervised object detection, which utilizes the same level of annotations as training for feature extraction.

This thesis studies a method that can be used to retrieve a particular object that is on a large database. The studied method is an application of a state-of-the-art feature extraction pipeline for image retrieval with weakly supervised object localization used as pre-processing. Firstly, a weakly supervised object detector based on convolutional neural networks is trained. Then this is used as an additional pre-processing step for the feature extraction pipeline. The feature extraction works by using transfer learning on a convolutional neural network trained for classification and is then fine-tuned for retrieval.

A descriptor is pooled from the final convolutional layer with generalized mean pooling.

The descriptor is then normalized and whitened.

This thesis is structured as follows. Chapters 2 and 3 discuss the preliminary knowledge and the works of others in the fields of modern machine learning, convolutional neural networks, feature extraction and object localization. The proposed pipeline is introduced in Chapter 4 which describes how the pipeline is constructed and how it could be implemented. Chapter 5 discusses the experiment methods and the obtained results of this thesis. Finally, Chapter 6 summarizes the study and presents the conclusions.

(13)

2 BACKGROUND

This chapter describes the background knowledge needed to understand the following sections. First, the concept of supervised machine learning is introduced. Second, the models of neural networks and convolutional neural networks are presented. Third, feature extraction is introduced in the context of image retrieval. Fourth, feature aggregation, which means combining individual features, is introduced. Fifth, image retrieval methods are presented. Finally, object localization is covered.

2.1 Supervised Machine Learning

Supervised machine learning is a method of machine learning where both the input and its desired output are known during training. For example, for a classification task, the correct class of a sample is known during the training. The error between the desired output and predicted output is calculated and the model is adjusted towards the correct output. Common supervised machine learning tasks include classification, where the task is to predict a class from a predefined set of classes for a given input sample. Another common task in the field of computer vision is object detection. Here, the task is to find both the class and the location of an object in the input sample. This type of tasks require more extensive target knowledge. Both, the class and the location of an object need to be known in the training images in order to utilize supervised machine learning. [12]

The target known during training is referred to as theground truth. Usually the available data, referred to as a dataset, is split into two or three sections: the training data, the test data, and sometimes also the validation data. Training data is given to the model during the training and the model is supposed to adjust to this data. The model therefore explicitly sees this data. Then the validation data can be utilized to adjust the learning procedure. Even though the model does not explicitly see this data, because the training procedure is optimized based on this data, the model usually performs better on this data than on a given data in use. Finally, the test data is used to report the performance of the model in a general case. It is supposed to represent the accuracy of the model in use. However, any time certain test data or benchmark is used in a publication, the test data affects the model indirectly. This is because of survivorship bias as only the models that perform the best on the test data are being published. Therefore it is common to use several datasets and report the accuracy on all of the datasets. Now, if a given model performs the best on all datasets, it can be assumed to be the best model in general.

(14)

Furthermore, understanding the differences between datasets can be used to analyze the strengths and weaknesses of the model. [12]

Weakly supervised machine learning is closely related to supervised machine learning.

In these type of tasks, some knowledge of the target is required, but much knowledge can be relaxed. For example, weakly supervised object detection requires only the class of the target, but not the location during training. These types of tasks have fundamental advantages and disadvantages. The obvious advantage is that less human labor is required as the ground truth is less extensive. The disadvantage that follows is that the models tend to be less accurate as less information was available during the training. [6]

2.2 Convolutional Neural Networks

Convolutional neural networks are an adapted form of fully connected neural networks.

To understand them, fully connected neural networks are presented briefly. The two models are presented in Figure 2.1. The discussion in this section is based on the works by Goodfellow et al. [12], Karpathy [25], and Zhang et al. [63]. Fully connected neural networks, often referred to as just neural networks, are a type of machine learning model is inspired by the neurological structure of the human brain where neurons activate or deactivate depending on the input. They consist of several layers of neurons. The neural network is a non-linear function that attempts to resemble a complex function, usually called the learning objective. This learning objective could be, for example, recognizing cats from an image or predicting housing prices given information about the houses.

Figure 2.1. Fully connected and convolutional neural networks side by side [25]. In a fully connected network, every neuron is connected to every neuron in the next layer.

Conversely, in a CNN, there is a fixed number of connections and that number depends on the kernel size.

A neuron is the smallest unit of computation in a neural network and it consists of three parts. First, the neuron has several weights that affect how strongly the neuron is connected to its predecessors. Second, the neuron contains an adder that combines the inputs and weight together to a single output. Third, the neuron has an activation function that determines how strongly the neuron activates. Mathematically, the neuron is represented as follows. Each weight is denoted by wi, where i is the index of the weight, together forming the weight vector w. Usually, each neuron is assigned a bias, which is a constant number that shifts the output up or down. It is sometimes denoted by borw0. The input to a neuron is denoted byx0, x1, . . . , xn =x. Then, the adder can be

(15)

expressed as

u=⟨w,x⟩=w^Tx

where⟨·,·⟩represents the inner product. The output of the neuron then becomes

y=ϕ(u) =ϕ(w^Tx) (2.1)

where the activation function is denoted byϕ. [25, 63]

Neurons are then combined to sequential layers. In a fully connected network, each network of the layer takes its input from each neuron on the previous layer and the output of each neuron is the input of each neuron in the next layer. There are three types of layers in a neural network: the input layer, the hidden layers, and the output layer.

When there are several hidden layers, the network is said to be deep. These differ from convolutional neural networks in that every input of the current layer is connected to every neuron on the next layer. Also, the input size affects the number of connections, unlike in convolutional neural networks. [63]

Convolutional neural networks(CNNs), on the other hand, are neural networks where each layer is replaced by a convolutional kernel. A convolution is an operation where each of the kernel weights is multiplied by the input sequence and then the kernel is slid to the next position and the multiplication is done again. A convolution is denoted by ∗ and is defined as

f∗g[i] =

M

∑︂

l=−M

f[l]g[i−l] (2.2)

whereiis the current index andMis the number of elements in the domain. For computer vision tasks, the CNNs usually contain 2D convolutions defined as

f∗g[i, j] =

M

∑︂

l=−M N

∑︂

m=−N

f[l, m]g[M −l, N−m] (2.3) wherei, jare the current indices andM, N is the extent of the domain. For images,M, N are the image height and width. [25]

Each convolutional layer consists of several of these convolutional kernels. The kernels can be, for example,3×3. The number of convolutional kernels determines the depth of the next layer. The input layer takes the input matrix of sizeM×Nas input. The first layer usually has several kernels, outputting a convolutional volume of sizeM×N×LwhereL is the number of kernels. After each layer, there is the activation layer, which determines which parts of the volume are activated. Then at some points there are down-sampling layers that reduceM andNby a factor, usually 2. At the end, there can be one or few fully connected layers, which are similar to the ones described in the previous subsection. [25, 63]

Convolutional layers usually deal in three dimensions, the width and height of the block and the additional channel dimension. When the raw image is fed to the network, it is

(16)

in three dimensions: height, width and channel. The channel dimension is of size three for color images and one for grayscale images. This concept of a convolutional block is illustrated in Figure 2.2. The 2D convolutional kernel slides over each channel. It is up to the architecture designer to design how many channels will be in the next layer. In the image, the input is a32×32×3image (width 32, height 32, 3 channels). The next layer has more than three channels and there will then be an equal number of convolutional kernels to the number of channels. Each channel has an additional bias term. Therefore, the total number of trainable parameters for a convolutional layer is dependent on the number of filters and the convolutional kernel height and width. For a 3×3 convolution kernel where there are 10 channels, there will be (3×3 + 1)×10 trainable parameters.

A notable improvement compared with the fully connected network, where the number of parameters is dependent on the input size. [25]

Like all neural networks, also CNNs are trained with gradient based iterative optimization.

This means that there is a loss function that outputs a single number representing how far away from the desired target the current model is. Then, the gradient of the loss function is calculated with respect to the network weights and the weights are updated to the opposite direction of the gradient to reduce the loss function in the next iteration of the optimization. Once the loss function has converged with respect to iterations, the training can be stopped. [25]

Figure 2.2. Convolutional block [25]. Each convolutional kernel slides over the input layer and outputs the result of the convolution to the corresponding place on the output convolutional block. The process is repeated for every input layer.

A particularly interesting CNN architecture is the ResNet-50 [18] architecture, which will be utilized in this thesis. ResNet-50 is a 50 layer residual network as shown in Figure 2.3. It differs from a regular CNN by having shortcuts that implement the residuality. Additionally, it utilizes Batch Normalization[20]. Batch normalization is a special component in CNN architecture that normalizes each batch after a convolutional layer and before the non-linear layer. This has the benefit of generalizing the model because this forces the distribution of the output of the convolutional layer to follow the Gaussian distribution with mean of 0 and standard deviation of 1. Batch normalization is a differentiable layer that does pre-processing at each block of the network and has been crucial in many neural network architectures. [12, 25]

(17)

Figure 2.3. ResNet-18 architecture [63, Ch. 7]. The ResNet-50 architecture follows this same idea but is deeper.

(18)

Mathematically, batch normalization is explained as follows. Let B be a given batch of sizemfrom the training datasetX. Then, the meanµ_Band varianceσ²_Bof the batch can be calculated with the following formulas.

µ_B= 1 m

m

∑︂

i=0

x_i (2.4)

σ_B² = 1 m

m

∑︂

i=0

(xi−µB)² (2.5)

wherexis the input to the layer. The normalization then happens by subtracting the mean and dividing by variance as shown by equation

x_BN= x−µ_B

√︂

σ_B² +ϵ

(2.6)

where ϵ is a small constant added for numerical stability. In some cases the variance might vanish, which would lead to division by zero and hence ϵ > 0 is added to make σ_B² >0,∀B∈X. [63]

Batch normalization is a layer that is put on the neural network after the convolutional layer and in front of the non-linear activation layer. One can be put at most for each hidden layer or for any number of layers smaller than that. For CNNs, the layer is calculated for each output channel and the statistical parametersµ_B and σ_B² are kept constant for all output channels, as they describe the batch statistics and do not depend on the output. It is a layer that behaves differently in the training mode and in the evaluation mode. In the training mode, the batch statistics, mean and standard deviation, are calculated for each batch individually. In the test phase, the statistics of the whole training data are used, which in essence means that the batch size is set to the training dataset size. It is not possible to use the whole training data for batch statistics during training, because the statistics depend on the current weights. Hence, every time the model is updated, the statistics change. Once the model is trained, it is possible and thus desirable to calculate the statistics for each layer from the entire dataset. [20, 63]

Another key component in ResNet architecture is the Rectified Linear Unit (ReLU) activation function [33] that acts as the non-linearity in the network. The outputs of the function are positive elements of the input. This means that inputs below zero are set to zero and other inputs are kept as they are. Mathematically, this is expressed as

y= max(0, x) (2.7)

wherey andxare the output and input of the activation function respectively andmaxis a function that returns the maximum of its arguments. This function is unfortunately non- differentiable atx= 0, in which case the left derivative is taken. Therefore, the derivative of ReLU equals 0∀x ∈ X : x < 0 and 1 otherwise. Therefore, it can be seen that the

(19)

derivative is the step function. [18, 33, 63]

There are also two types of pooling seen on the ResNet architecture, max-pooling and average pooling. The idea of a pooling layer is to select certain values from the input and discard others such that the dimensions of the input are reduced. A typical example for convolutional networks is the max-pooling layer that looks at each2×2region in the input and outputs the maximum of the four values. An average pool layer, on the other hand, would take the average of the four values. In ResNet, the average pooling layer is a so-called global average pooling, which means that the average is taken cross-channels.

The proposed ResNet architecture [18] uses a max-pooling layer after the first convolution and a global average pooling layer in the end to fix the output vector size. [18, 63]

The idea behind ResNet architectures is to design a very deep yet expressive neural network architecture. It is applicable to both, convolutional and fully connected layers.

Usually, when adding more layers, the neural network becomes different, but might actually be worse. This is why the ResNet architecture includes shortcuts in its design.

These shortcuts attempt to make sure that adding more layers does not make the new, deeper network worse. The shortcut connects the output of the previous layer to the next layer by adding them together. This architecture is visualized in Figure 2.4. A residual block consists of two layers with learnable parameters, for example convolutional or fully connected layers. The block goes sequentially as follows. First, the input goes through the first learnable layer. The input to the block is denoted by x.

Second, a batch normalization layer is added. Third, the output of the previous layer is activated with a ReLU layer. Fourth, another batch normalization layer is added. Fifth, the second learnable layer follows. These five layers form the functionF. To this output, the input is added. Mathematically, this is expressed as

y=σ(F(x) +x) (2.8)

where y is the output of the residual block and σ is the ReLU activation function. The functionF is just the chained function of the five layers it represents. For expressing this mathematically, the following notation is used

f(g(x)) =f◦g(x) (2.9)

wheref andgare functions. NowF can be expressed as

F =B_N ◦W₂σ◦B_N◦W₁x (2.10)

whereW1 andW2 are the connection weights of the first two layers andB_N is the batch normalization function. This form assumes fully connected layers. For convolutional layers, the multiplications with W need to be replaced by convolutions between the convolutional kernel and the output of the previous functions. Then, the addition in Equation 2.8 is an element- and channel-wise addition. Mathematically, this is

(20)

expressed as

F =B_N ∗K₂◦σ◦B_N◦K₁∗x (2.11) where K_{i} are the convolutional kernels and ∗ is the convolution operator. It should be noted that these equations assume that the input and output of F have the same dimensions. If that is not the case, the input and output can be made to have the same dimensions by multiplying with a linear translation matrixWs. [18, 63]

Figure 2.4. Residual block for ResNet architectures [63, Ch. 7]. The left block is a regular block and the right is a residual block. The residual block shows the residuality of the network, where the input to the block is again connected to the output.

All neural networks aim to learn some representation. The ResNet architecture aims to make it easier to learn this representation. Learning this representation is usually difficult for deep neural networks. In theory, it should be possible to better approximate the optimal solution with deeper networks as their added dimensionality that arises from additional parameters allows a closer or at least equal approximation of the optimal solution. This is because in the case where a shallower network is able to perfectly approximate a given representation, the deeper should be able to do the same by setting the additional layers to identity mapping, which means outputting the input to the layer. [18]

Mathematically, given an inputx, a block of layers in a neural network should be able to approximate a mappingH(x). If it is assumed that a block of layers in a neural network can approximate a given mapping, then they should also be able to approximate a residual mapping H(x)−x. The, the block of layers can be made to learn this residual representation instead of the original H. This is denoted by F(x) = H(x)−x, which means that when optimizing forH, the network representation becomesF(x) +x, which is similar to Equation 2.10. This formulation is able to be learned more easily than the original formulation. In the case where a smaller network is able to best approximate the original mapping, this formulation should be no worse. In that case, F(x) should approach zero and the block becomes an identity block, outputting its input. This makes it easier for the optimizer to learn such a representation as it should be easier forF(x)to approach zero and thus F(x) +xbecome the identity mappingx, compared withF(x) approachingxdirectly. [18]

Dropout[51] is a mechanism by which a neural network can be generalized and hence

(21)

preventing overfit. It was originally assumed for fully connected neural networks. The basic idea is to prevent certain neurons from activating by multiplying their output by zero before the next layer. In this way, the neural network cannot rely on a few neurons to make the predictions, but needs to spread the prediction power somewhat evenly. Hence, it is an inexpensive method of regularization. In essence, the full neural network is divided into an ensemble of subnetworks and each subnetwork occurs when certain neurons are switched off. This works by assigning a probabilitypfor a given neuron to be included in the subnetwork, and conversely a probability1−pto be excluded. The value ofpdepends on the amount of regularization desired and the network width. A common value could be, for example,0.5. With p, a binary mask is calculated for each batch during training.

Hence,p×100%of the neurons will be included. The binary mask is recalculated for each batch. These are implemented in a layer that is injected before each applicable layer. [6, 12]

For CNNs, dropout is not as effective as it is described above. This is because for an image, the pixels next to each other are strongly related and hence the convolutional activations are also autocorrelated. This problem is overcome by Tompson et al. [57]

who present spatial dropout. In this method, the entire feature map is dropped out instead of individual positions on the feature map. This has the benefit of preserving the autocorrelation within a feature map but introduces the dropout generalization to CNNs thus making individual feature maps less dependent on each other. Additionally, the strongly activated pixels contain the most information in the feature map and hence are more important. Dropping out these would therefore be highly beneficial. To overcome this, Park & Kwak [36] propose dropping out the maximally activated pixel on a feature map instead of a random pixel. This has the benefit of spreading the important pixels and thus having the same effect as the standard dropout. [63]

2.3 Feature Extraction

Feature extraction is the process of finding unique and descriptive small areas in an image. This area might include, for example, the coordinates of the center of the area, as well as its size and orientation. Traditionally, the feature extraction process consists of extracting features and descriptors separately. Features are the locations of the areas and descriptors are unique identifiers to a given feature. In other words, each feature will ideally have its own unique descriptor. There are several classical methods for feature and descriptor extraction, the most common being SIFT [11] and SURF [3]. These work by mathematically determining keypoints in the image based on low-level image processing to find the feature locations and then extract the descriptor around the detected feature. Usually the feature is either a point in the image or a small area. The descriptor, on the other hand, is an one dimensional vector of custom length, for example a power of two. It can be noted that the features do not need to be keypoints, even though they usually are. Sometimes the features can be lines, curves, or patches. [52, Ch. 4]

(22)

The traditional methods detect first the features and then the descriptors. In addition to SIFT, Hessian Affine Features [38] are commonly used in the context of image retrieval [41]. These have the advantage of being extremely fast to compute since they require only matrix multiplications to detect the features. The disadvantage is that they are hand-crafted which means that they are not based on actual images, but rather mathematical models and assumptions of the image data. Therefore, these models are best suited for general cases when training data is not available, there is very little of it, or the problem domain is large. Additionally, these models do not require any training data, which can be an advantage.

The modern data-driven methods are usually based on CNN models. These models can be roughly divided into two categories: local feature and descriptor extractors and global feature extractors. The local feature extractors are similar to the traditional methods in that they extract features that can be pinpointed to a certain region in the image, such as a key point at location(x, y). The global feature extractors, on the other hand, consider the image as a whole. Here, the distinction between features and descriptors is not obvious, as usually only a single vector is returned for a single image. Therefore it can be said that global feature extractors transform the image into the feature space directly.

The CNNs used for these in the context of image retrieval are usually common classification backbones fine-tuned for feature extraction, but for generic feature extraction custom architectures can be used. [41] Research shows [41] that data-based feature extractors outperform the traditional models. This might be because the models are able to fit to the training data. The data-based models can also consider a larger area around the feature as well as use the whole image when detecting features to account for repeatability and illumination changes. These are difficult to model analytically for the hand-crafted feature detectors.

Feature extraction is the most fundamental step in creating a representation for an image.

It is one of the most important parts in image retrieval. For this task, the features for an image should be as unique as possible and so that features from similar images are close together in the feature space and dissimilar features are far away. Traditional feature extraction algorithms rely on localizing hand-crafted features and then extracting hand- crafted descriptors around the features [11, 48]. Here, the features are first detected and then the descriptors extracted. In the traditional sense, features mean the locations of the descriptors, but since with deep global features there are only descriptors, these descriptors are also known as features. In deep learned feature extraction, the general pipeline is to first train the network to discriminate dissimilar images. The feature is extracted by pooling a convolutional layer. [41]

Classical feature extraction algorithms work by calculating gradients in the image to recognize unique points of interest in the image that could potentially be found in other images of the same scene. Then, an area around each point of interest is looked at and encoded into a fixed-length descriptor. To match two images, one would need to find descriptors from both images and if enough similar images are found, then the two

(23)

images are a match. This can additionally be geometrically verified by ensuring that the features are located within a linear transformation from one image to another. This means, for example, that descriptors from two images of the same statue from the same position but with one image rotated need to be able to be aligned by rotating the points of interest from the other image. [15]

Since the introduction of CNNs for classification in 2012 [29], CNN-based descriptors have been able to be used in a successful manner. In this method, the descriptors are extracted as they are detected. Usually, the final convolutional layer is pooled for the descriptor using various schemes. Babenko et al. [2] propose using the activations of CNNs for image retrieval. They use a CNN trained for classification with the ImageNet [49]

dataset. They experiment with taking the descriptor from different parts of the network and notice that the layers in the middle yield better results than the final fine-tuned fully connected layers. A similar conclusion is reached by Ng et al. [34]. It is also noted in research by Babenko et al. that training with classification loss on images that are similar to the retrieval images improves retrieval performance as the network learns a representation for that type of images as well.

Gordo et al. propose fine-tuning a CNN to produce these descriptors directly, rather than use a so called off-the-shelf network like Babenko et al. They use a three stream Siamese network to learn this representation with a triplet loss [4] that pushes representations of similar images together and different images away from each other.

The three streams of the network share the weights and each is shown a single image at a time: a query reference image, a positive image, and a negative image. The loss attempts to push the positive image representation close to the query representation and the negative representation away from the query representation. Mathematically, this is expressed as

L(I_q, I_p, I_n) = 1

2max(0, m+∥d_q−d_p∥²− ∥d_q−d_n∥²) (2.12) whereIq, Ip, Inare the query image, positive image, and negative image respectively,m is the margin for the distance that controls the separation between positive and negative pairs, and d_q, d_p and d_n are the descriptors of the query, positive and negative image respectively. [13]

Another loss function for retrieval is the contrastive loss [7] used for example by Radenovic et al. [42]. That has the similar goal of representing similar images close together and dissimilar images away from each other. However, with contrastive loss the network is a two stream Siamese network. The idea is to use pairs of images for training, with the labelybeing whether they are a positive pair (similar images,y= 1) or a negative pair (different images,y= 0). Mathematically, contrastive loss is defined as

L(Ii, Ij) =

⎧

⎨

⎩

1

2∥d_i−dj∥², if y= 1

1

2max(0, τ − |d_i−d_j∥²), if y= 0

(2.13)

(24)

where Ii and Ij are two images and di and dj their respective descriptors, and τ is a scalar controlling the margin of the loss.

2.4 Feature Aggregation

After the features have been extracted, they can be aggregated. This means that from multiple features and descriptors per image, a single feature representation is constructed. For example, one could extract a thousand features and descriptors from an image. Then after aggregation, there would be only a single combined feature and descriptor left. [41, 54]

The feature aggregation begins with a clustering algorithm. Clustering is an unsupervised machine learning method, where the data is classified into a certain number of classes based on the data itself with no prior learning needed. However, to use clustering for feature aggregation, the clusters need to be learned from some training data. This can be either database data from a benchmark dataset, or training data used to learn the feature extractor. This leafs only to a quantization where each descriptor is simplified to a cluster center, but does not yet lead to aggregation. Aggregation is then combining the assigned cluster centers to a single vector. This makes the feature search faster, but it is not necessary. [41, 54]

The aggregation can be represented with a mathematical function that maps the individual features into a single embedding. This usually requires matching each descriptor to a visual word, or in other words, a cluster center that was obtained from clustering. Usually, these kernels involve some sort of pooling to construct each element of the vector. A simple aggregation method could be, for example, taking the maximum element at positionifor a descriptor vectorxfor eachi∈0,1, . . . , N −1whereN is the number of elements in the vectorx. [54]

For the global CNN-based feature extractors, the feature aggregation happens usually in the network architecture. The features are usually represented by the last convolutional layers of the CNN architecture. The single feature vector is constructed by global pooling the last layers of the block. These can be, for example, taking the average of each element in each direction of the block. [41]

2.5 Image Retrieval

Image retrieval is a task that can be divided into two separate sub-categories. The categories areContent-Based Image RetrievalandText-Based Image Retrieval. This thesis focuses on the former: Given an example of the desired results, how is it possible to find the other relevant images from a database. This is retrieval done completely using image processing techniques. The latter is a retrieval task performed on the text domain. There, all images are annotated into keywords and retrieval is done matching the text describing the images. The major difference is also in the query. In

(25)

content-based image retrieval, the query is an image, a sketch or some other visual media that is an example of the desired results. In text-based image retrieval, on the other hand, the query is a textual description given by the user. [39]

Unorganized collection of

images

Feature extraction model Database

Query by

example Feature extraction model Comparison

metric Set of

results Provider

User

Apply Index

Apply Apply

Compare

Retrieve

Figure 2.5. Overview of a generic image retrieval pipeline. The service provider has initially a group of unorganized images, from which features are extracted. The user can then search from this database with an example image. Features are extracted from the example image similarly and the features in the database that are closest to the example image correspond to the retrieved images.

The goal in image retrieval is to find all relevant images from a database given an example image. This poses several challenges. First, determining the particular object in the image that the user desires to be retrieved can be challenging. Second, the viewpoint between relevant images can vary widely. For example, one could have two images of the Tower Bridge in London but from different viewpoints and thus the two images will be quite different. Alternatively, the images might be from different distances. Third, there might be occlusions, truncations and differences in lighting in the images which will make the retrieval hard. The particular problem studied in this thesis is the problem of small objects in the image. There might be cases where the user is not able to go close to the object, or the user might not have image processing software in hand to indicate the desired region in the image. Additionally, using automation to detect regions in the image makes the image retrieval process faster and thus could make it more accessible for the user. Furthermore, the scene might be very cluttered with a lot of irrelevant objects and thus using automation could make image retrieval a viable solution to this problem. Such a case might be a bird on a safari where there might be a lot of trees, plants, and other animals besides birds in the image but the user is only interested in finding the bird.

Image retrieval therefore consists of two stakeholders, the user and the service provider.

The user provides the example image and expects relevant images in return. The service provider then has the database from which to search and conducts the search, given the

(26)

query example. An overview of image retrieval is presented in Figure 2.5.

After there is a single vector representing each image in the database, the search happens by comparing the query image with each database image. The search can be made faster by using a tree representation, but some accuracy is lost. For the most accurate image retrieval results, the search is conducted as a full nearest neighbor search. In the nearest neighbor search, the query vector is compared with each database vector using a distance metric. Usually, this metric is L2, but others can be used. Thekclosest matches are then theknearest neighbors.

For cases where there are multiple vectors per image this gets somewhat more complicated. This case might arise if there are multiple regions in the image, from all of which a single vector is constructed, or if the descriptors are not aggregated. In this case, the matching becomes many-to-many matching. The simplest solution would be to aggregate the multiple vectors and construct nearest neighbor search as described earlier. Otherwise, for each vector in the query, the results are retrieved. Then the retrieval results are combined in some way, for example based on the distance. This case is, however, out of scope of this thesis.

2.6 Localization

Localization is the process of locating a given object in an image. Challenges in object localization include distractions, object truncations and occlusions, and intra-class variation. A distraction refers to object-like visuals in the image that do not represent the object but might mislead the object detector. Truncation refers to the object not being completely visible in the image as it is being cut off from the frame. Occlusions occur when another object is in front of the desired object. Intra-class variation refers to the class containing multiple different instances and appearances of the same class. All but intra-class variation affect also image instance retrieval and hence localization can help in image retrieval by solving part of the problems, especially by removing distractions.

Traditionally, a sliding window [8, 16] has been used for object detection. The algorithm is outlined as follows. A detection window is set on a corner of the input image. Then the contents of the window are sent to an object/non-object classifier. If the classifier predicts an object, the current window is a detection result. Then the window is slid to the next position. After the whole image has been exhaustively searched, the window size is changed and the search conducted again. To prevent multiple overlapping detection results, non-maximum suppression is run in the end. The sliding window method was optimized by Uijlings et al. [58] in a method called selective search where the image is over segmented into regions using a graph-based method [9] and then the regions are aggregated and region proposals are generated. The core benefit of selective search is that the image is not exhaustively searched, but the possible locations are found out first and then evaluated.

Currently, supervised object detectors [31, 44, 45] dominate the field of object

(27)

recognition in terms of accuracy. However, these methods need a large dataset of images and instance level annotations, which means all object regions are annotated.

This is a slow, tedious and costly process. These are CNNs that can be classified into two categories: single-stage detectors and multi-stage detectors. In the single-stage detectors, the input is passed through the network only once in a single unified network, while in the multi-stage detectors, there are multiple streams to the network, namely a network predicting class agnostic object locations and a network for classifying the detections. Multi-stage networks are therefore inherently slower than single-stage networks, but they can be more accurate. Single-stage networks include SSD [31] and YOLO [44] while multi-stage networks include Faster R-CNN [45] and Mask R-CNN [17].

Figure 2.6. Example of CAM based weakly supervised object detection. Detection is based on convolutional activations, which are represented by colors. The bounding box is also visualized.

Recently, several weakly supervised methods have started to show promising results.

These output either bounding boxes [6, 65] or segmentation [60, 61]. These methods need only image level annotations, which means that only the class of each object in the image is needed for training the detector. They rely onClass Activation Maps(CAMs) for localization. CAMs are the output before fully connected layers in classification CNNs.

The parts of the image that contain discriminative features in the image likely contain also the desired object. Using vanilla CNN for classification leads to too small detection results and hence a dropout layer can be added as proposed by Choe & Shim [6] in the method called the Attention-based Dropout Layer(ADL). ADL is utilized in this thesis for pre- processing. Example of an object detection with this method is shown in Figure 2.6

(28)

3 RELATED WORK

This chapter introduces the relevant work to image retrieval conducted by others. First, feature extraction and image retrieval methods are covered. These describe how related images can be found from a pool of features extracted from the database images. This includes how to construct the feature vector, which is also known as the descriptor.

Second, methods that use localization are introduced. These methods use a localization method to narrow or draw attention to specific areas of the image when constructing the descriptor. Finally, methods that use post-processing are introduced.

3.1 Features for Image Retrieval

In the image retrieval pipeline, features are extracted from all images. This raises two important questions, namely, how to extract features from images and how to search for related images using only the extracted features. Two types of approaches have been proposed in the literature for the feature space search, an inverted file structure and single embedding. Before the rise in popularity of the global CNN features for image retrieval, local features andBag-of-Words(BoW) model [39, 50] were dominating the literature. In these methods, the feature space consists of local invariant features are extracted from an image and then quantized with a codebook generated from clustering. These methods can apply both, the inverted file structure, or single embedding. Similarly, CNN-based methods have applied both methods. The difference is that only local features can use the inverted file structure, but local features can be combined into a single global feature (embedding) and thus use the single embedding approach. [41, 64]

The inverted file method for feature space search resembles the idea used in document search. In document search, the document is first separated into words. Then, each word appearing in a document is stripped or quantized into its basic form. This means that “walking” becomes “walk”. Third, irrelevant words like “the” are removed. Finally, an index is created. In the index, each word is mapped to the documents in which it appears.

The similar pipeline can be constructed for images apart from the third step. In the first step, the features are extracted from an image. Then, the extracted features are assigned into visual words. The available visual words come from a vocabulary (also known as a codebook) that is learned by clustering on a similar dataset to the database images. The visual words can be, for example, the cluster centers of a K-Means algorithm. Finally, for each visual word, all images where said word appears are listed. Then the images

(29)

that appear for the most visual words for a given query are selected as the top retrieval results. [39, 41, 50, 64]

The single embedding approach can be applied for local features that are aggregated into a single descriptor per image, or for global features that are naturally only a single vector per image. For local features methods such as Vector of Local Aggregated Descriptors(VLAD) [22] andSelective Match Kernels(SMK) [54] and its variants [55]

can be used. VLAD was initially developed for constructing a single descriptor for an image from the local SIFT features. The SMK similarly offers a method for constructing a single embedding from multiple feature vectors. The pipeline for constructing a single embedding from several local features consists of two steps: embedding and aggregation. In the embedding step, each local feature is transformed into a higher dimensional space. Then, in the aggregation phase, all embedded features are combined into a single representation.

The following notation and mathematics obeys the framework set by Tolias et al. [55].

In all of these methods, the image X is represented by a set of vectors of sizeM that are the local descriptors X = {x₁, x2, . . . , xM} of dimension D. Then, the codebookC represents the learned vocabulary of visual words. This is usually learned with K-Means.

In this pipeline, each local descriptor is assigned a visual word c ∈ C. The descriptor is assigned to a visual word c through a nearest neighbor quantizer q(x) and this set of quantized descriptors is denoted X_C = {x∈ X |q(x) =c}. Using this notation, the similarityK between two images X andY and their respective sets of local descriptors X andYcan be calculated in general as

K(X, Y) =γ(X)γ(Y)∑︂

c∈C

σ(ϕ(X_C)^Tϕ(Y_C)) (3.1) where

γ(X) = 1

σ(ϕ(X_C)^Tϕ(Y_C))

is a normalization factor, σ(·) is the selectivity function, and ϕ(X_C) is the aggregated representation.

For VLAD with this embedding notationϕ(X_C)is represented by V(X_C) = ∑︂

x∈X_c

x−q(x) (3.2)

and the selectivity functionσ(·)is

σ(u) =u (3.3)

whereuis the input to the function. This means that the residuals between all the local descriptors and their codebook counterparts are summed together and the selectivity

(30)

function is the identity function. Setting these into Equation 3.1, it becomes K_{V LAD}(X, Y) =γ(X)γ(Y)∑︂

c∈C

V(X_C)^TV(Y_C) (3.4) which is the similarity according to the VLAD method of generating a descriptor. This method of aggregating beats the BoW approach in both retrieval performance, memory consumption, and search speed. SMK differs from VLAD by using a different selectivity function and by using normalized residuals. Mathematically, the SMK kernel is described as

K_{SM K}(X, Y) =γ(X)γ(Y)∑︂

c∈C

σ_α(rˆ(X_C)^Trˆ(Y_C)) (3.5) whereσ_αis the selectivity function described mathematically as

σα(u) =

⎧

⎨

⎩

sign(u)|u|^α, if u > τ

0, otherwise

andrˆis the normalized residual that can be calculated as r

ˆ(x) = x−q(x)

||x−q(x)||

for every descriptor x. The selectivity function intuitively selects the relevant descriptor pairs by setting all non-relevant pairs whereτ ≥ u to zero. Then the relevant pairs are emphasized by increasing them to the power of alpha. Tolias et al. [55] describe using α = 3and τ ≥0. For global descriptors, a single descriptor is generated using a neural network. Mathematically, the similarity between two imagesXandY, each with a single global descriptorxandy, using the previous notation is

K_global(X, Y) =

⎧

⎨

⎩

√︂

∑︁M

i=1(x_i−y_i)² for euclidean space

x^Ty for cosine space (3.6)

which is the distance between the descriptors in two different spaces.

Several methods have been proposed to generate the deep global descriptors for retrieval. Babenko et al. [2] used the output of the fully connected layers in CNNs as the descriptors. Spatial max-pooling for visual recognition for each convolutional activation layer was proposed by Azizpour et al. [1] and similarly sum-pooling was proposed by Babenko & Lempitsky [62]. These methods are similar to the feature embedding methods described earlier.

The sum-pooling, which is in practice the same as average pooling since the descriptor is normalized, is a simple method of pooling. Here, the individual embeddings are added together elementwise. For deep features from convolutional activations, this is slightly more complicated. This method was presented by Babenko & Lempitsky [62] and was named Sum Pool of Convolutional(SPoC) features. Each deep convolutional feature

(31)

fx,y is constructed from the convolutional activations in the depth direction of the convolutional block for coordinates (x, y). The set of deep features for image X is denoted byX_f The embeddingϕfor this method is

ϕ(X_f) =

H−1

∑︂

y=0 W−1

∑︂

x=0

fx,y (3.7)

whereH andW are the height and width of the output channel respectively. This can be appended with a center prior, which adds an additional scalar multiplea_x,yin front off_x,y in Equation 3.7. Additionally, the descriptor can be normalized, whitened and normalized again similarly to to the method in this thesis in Chapter 4. [62]

Another common pooling method is called the Maximum Activation of Convolutions (MAC) and its regional variant R-MAC [56]. In MAC pooling, the feature vector is constructed by taking the maximum of each convolutional output channel. Given an input image X, the output of the convolutional network will be a 3D block of size H×W ×KwhereH andW are the height and width of the output channel andK is the number of channels. Then, the MAC descriptor will be constructed with the equation

ϕ(X_f) = maxf_x,y∀x∈0. . . W −1,∀y∈0. . . H−1 (3.8) which is the same as SPoC but instead of summing, the maximum of the available channels is selected. The regional variant of this same descriptor is similar, but instead of taking the maximum along the whole output channel, the maximum is taken from a given rectangular region. Mathematically, this is defined as

ϕ(X_f) = maxf_x,y∀x, y∈R (3.9) whereR represents all the points(x, y)in a given region. The regionsRare pre-defined in a regional grid by placing pre-determined rectangles on the output channel at different scales. The R-MAC descriptor is then the sum of all of these regional MAC-descriptors.

Mathematically, this is given by the equation ϕ(X_f) =

M

∑︂

i=0

maxf_x,y∀x, y∈R_i (3.10) whereM is the number of regions. [56]

The most recent version that builds upon the SPoC and MAC descriptors is the Generalized Mean (GeM) descriptor. Both, average pooling and max-pooling are special cases of this GeM pooling. The benefits of this method is that it can be learned and that is is more general than the other two. Radenovic et al. [42] showed that this pooling method significantly improves the image retrieval results over the two previous methods. Let F be the set of all output feature vectors f_x,y. Otherwise using the same

(32)

notation, this is mathematically expressed as

ϕ(X_f) =

⎛

⎝ 1

|F |

∑︂

fx,y∈F

f_x,y^p^k

⎞

⎠

1 pk

(3.11)

wherep^k is a pooling parameter. It can be set manually or learned.

For global features, either directly global from a CNN or through local feature aggregation, the feature space search is a simple nearest neighbor search. The metric used for distance is the same as the metric used for training. However, when only the order of the retrieved images is important, a different metric can be used. For instance, even though euclidean distance is used for training by Radenovic et al. [42], they use cosine distance for the L2-normalized vectors. This does not change the order of the distances even though the relative scores do change. This is because L2-distances between normalizedN-dimensional vectors are monotonic to cosine-distances between the same vectors [40]. For all of the methods, the retrieval results will be the k closest matches, which with the established notation means the klowest scores of the function K. The nearest neighbor search is described mathematically as

X^′ = arg min

∀X∈XN

K(X, Y) (3.12)

where the image X^′ is the closest match to the query imageY and X_N is the database of images.

3.2 Regional Search in Image Retrieval

Localization is used to determine regions in an image where an object is located. This differs from local feature extraction by defining a large area in the image, from which the features are extracted. This region where the object is located is bounded by a bounding box. These can be utilized in image retrieval in several ways: as a pre-processing [53], in constructing the feature vector [13], and to re-rank the retrieval results [56]. It is not a necessary step in retrieval, but can be useful in many ways, either by improving retrieval results or by automating bounding box generation for improved retrieval results. An example of how regional search can aid in image retrieval is presented in Figure 3.1.

Briefly, the idea is to first detect regions in the image, then search from them for features.

In the early works of regional search for image retrieval, a rigid grid [56] or a random grid [43] is used to determine the regions from which the descriptors are extracted. Two distinct approaches have been proposed for utilizing object detection for regional search in the context of image retrieval. The earlier proposed by Gordo et al. [13] use aRegion Proposal Network(RPN) similar to [45] concurrently with the feature extractor to detect the regions from which to extract features. In 2019, Teichmann et al. [53] proposed to learn the object detector as a pre-processing step for the feature extractor.

Large Scale Image Retrieval of Small Objects