Custom Object Detection with Deep Learning and Synthetic Datasets

(1)

Custom Object Detection with Deep Learning and Synthetic Datasets

Faculty of Information Technology and Communication Sciences (ITC) Supervisor: Pasi Pertilä Examiner: Olli Suominen Master of Science Thesis September 2021

(2)

Khoa Nguyen: Custom Object Detection with Deep Learning and Synthetic Datasets Master of Science Thesis

Master’s Degree Programme in Computing Sciences - Tampere University Research assistant - 3D Media research group

May 2021

Detecting and localizing tree trunks is a necessary component for automatic harvesting machines. This task can be split into two sub-tasks: object detection and depth estimation. Object detection is a crucial problem that appears often in the field of computer vision and robotics. The advancements of deep neural networks have been making remarkable improvements for object detection systems. However, deep learning methods require a large amount of data, making custom objects that are not annotated in massive public datasets diﬀicult to detect. This work presents an attempt on addressing the problem of data shortage by proposing a method to construct synthetic datasets automatically for the custom objects, and then quantitatively examines the performance of state-of-the-art deep learning models on these synthetic datasets. Furthermore, the depth information of the detected objects is also estimated from stereo images and mapped onto the detected objects. The results show that the synthetic datasets can be used to train the neural networks to detect visually correct tree trunks on unseen new images. The resulting software can be used for any given custom objects with the process similar to that of tree trunks.

Keywords: Object detection, synthetic datasets, object localization

(3)

1 Introduction . . . 1

2 Object Detection and Depth Reconstruction . . . 7

2.1 Neural Networks . . . 7

2.1.1 Feedforward Neural Networks . . . 7

2.1.2 Convolutional Neural Networks . . . 9

2.1.3 Backpropagation Algorithm . . . 14

2.1.4 Convolutional Neural Networks as Backbones for Feature Ex- traction . . . 15

2.2 Object Detection Using Deep Neural Networks . . . 20

2.2.1 COCO Dataset . . . 22

2.2.2 Mask R-CNN . . . 24

2.2.3 RetinaNet . . . 30

2.3 Depth Reconstruction Using Stereo Imaging . . . 33

2.3.1 Triangulation . . . 34

2.3.2 Stereo Rectification . . . 35

2.3.3 Depth Maps from 3D Reprojection . . . 38

3 Detecting Tree Trunks and Finding Their 3D Coordinates . . . 39

3.1 Synthetic Dataset Preparation . . . 39

3.1.1 Hard Negative Image Preparation . . . 39

3.1.2 Training Image Preparation . . . 40

3.1.3 Test Image Preparation . . . 46

3.2 Neural Network Hyperparameters and Training Procedure . . . 47

3.3 3D Reconstruction . . . 47

3.4 Methods Integration . . . 49

3.5 Object Detection Evaluation Metrics . . . 50

4 Results and Discussions . . . 53

4.1 Discussions . . . 56

4.2 Explorations and Future Works . . . 57

5 Conclusions . . . 58

References . . . 65

(4)

1 Introduction

Figure 1.1 Example images of the forest tree dataset. Top-left: the first scene captured by the left camera. Top-right: the first scene captured by the right camera. Bottom-left and -right: another scene taken by the left and right camera, respectively.

Given a dataset of forest images taken by a stereo camera attached on a tree- harvesting vehicle illustrated in Figure 1.1, the goal of this project is to build a system that could detect the tree trunks and find the relative distance from each detected tree to the camera, or the “3D location” of the tree. Such system could be used to aid the process of automatic and remote agricultural harvesting.

With this task in mind, one would imagine a system that consists of two frameworks: an object detection framework and a depth estimation framework. Object detection is a key component for many other computer vision tasks and systems, such as image captioning, video surveillance, image information retrieval, robotics, etc.

Figure 1.2 illustrates the differences among problems related to object detection. As opposed to object classification tasks where only the categories of recognized objects in an image are produced, object detection goes one step further to find the spatial location for each instance. Semantic segmentation and object instance segmentation are tasks that aim to assign a label for every pixel in an image. The main difference between these pixel-level labeling tasks is that while semantic segmentation treats multiple objects in a class as a single entity (Figure 1.2(c)), object instance segmentation treats multiple objects of the same class as distinct and separated entities

(5)

Figure 1.2 Different recognition problems related to object detection. Figure from (L.

Liu et al. 2019)

.

(Figure 1.2(d)). In this work, the problems of tree trunk detection and instance segmentation are studied. Another important task in helping machines understand their surrounding is 3D reconstruction, or depth estimation, which is the process of recovering depth information from 2D images (Moons, Van Gool, and Vergauwen 2009). It is a core problem in a wide variety of fields, such as computer graph- ics (R. Jiang et al. 2013), medical image computing (Speidel et al. 2020), or virtual reality (Bruno et al. 2010).

Early day computer vision and object detection systems rely on high quality extracted features, such as the gradient-based features (Dalal and Triggs 2005a), scale-invariant feature transform (SIFT) (D. G. Lowe 1999), or spatial pyramid matching (Lazebnik, Schmid, and Ponce 2006). Furthermore, classifiers used in these systems also need to be robust enough to achieve a good result (X. Jiang et al. 2019a). In 2012, with the use of deep neural networks (DNNs), Krizhevsky, Sutskever, and Hinton achieved considerably better results than previous state-of- the-art methods on the ImageNet classification task (Krizhevsky, Sutskever, and Hinton 2012). Since then, DNNs have gained more and more popularity due to many reasons: DNNs can learn to extract high-order features from data, which could get better when more training data are provided; DNNs can be constructed to be end-to-end systems that combine detection and recognition into one framework, therefore, it can take the input as an image, and then outputs recognized and

(6)

detected objects (X. Jiang et al. 2019a; Goodfellow, Yoshua Bengio, and Courville 2016). Convolutional Neural Networks (CNNs), a specific type of DNN algorithm that utilizes the convolution operation for processing grid-like topology data (Good- fellow, Yoshua Bengio, and Courville 2016), have greatly improved the performance of image processing system, such as in (Lin, Goyal, et al. 2018; He, Gkioxari, et al.

2018).

Detecting objects and localizing them in 3D coordinates have been widely applied in robotics systems (Ge et al. 2019). In this project, the task is to build a system that can detect and localize tree trunks in the forest settings from stereo images, which could be seen as an element for building remote controlled or even fully automated harvesting robots. More specifically, deep learning approach is chosen for the task of object detection and segmentation, and a classical method is employed for depth estimation. The reason is that deep learning techniques are quite mature for object detection, while they are still at earlier stage for depth estimation (Yuniarti and Suciati 2019).

One of the problems when using DNNs is that they require a large amount of data, which can take a very long time and a lot of resources to gather (Lin, Maire, et al. 2015). There are many publicly available datasets with millions of images and rich annotations for a wide range of object classes to leverage the effectiveness of DNNs computer vision tasks. For object detection, there are four famous datasets: PAS- CAL VOC (Everingham et al. 2014), ImageNet (Deng et al. 2009), MS COCO (Lin, Maire, et al. 2015), Open Images (Kuznetsova et al. 2020). More detailed information of these datasets are shown in Table 1.1. Up until now, Open Images from Google is the largest existing dataset with object location annotations (Kuznetsova et al. 2020). It contains around 16M bounding boxes for 600 object classes, and

Dataset Number of images Categories Started year Highlights

PASCAL VOC (Everingham et al. 2014) 11 540 20 2005

20 categories of objects that are common in real life;

objects are in scene context

ImageNet (Deng et al. 2009) 14 million + 21 841 2009

Contains a large number of images and object categories;

ILSVRC challenge’s dataset;

images are object-centric

MS COCO (Lin, Maire, et al. 2015) 328 000 + 91 2014

Richer object annotations;

contains object segmentation data (not available in the ImageNet dataset)

Open Images (Kuznetsova et al. 2020) 9 million + 600 + 2017

15 million + boxes on 600 categories; 2.5 million + instance segmentations on 350 categories;

Also contains visual relationships and localized narratives data

Table 1.1 Popular datasets for object recognition

(7)

2.8M segmentation masks for 350 object classes¹. However, given their considerable magnitudes, these datasets still lack annotations for many object categories in real life, including the tree trunks needed in this project. MS COCO does not have a tree trunk category. Open Images has bounding boxes for tree trunks, yet they are not in high quality, and there are no tree instance segment annotations. Figure 1.3 shows some low-quality examples of tree annotations from the Open Images dataset. It is clear that these annotations are not suﬀicient for the task of tree trunk detection.

As we do not have a pre-labeled dataset for tree trunks, constructing a synthetic dataset is a sound option. In this work, we propose a method to construct the tree trunk dataset automatically. The process of contructing the synthetic dataset involved putting foreground tree trunk objects onto different background images, then using contours to find out the instance segmentation and box annotation for

1https://storage.googleapis.com/openimages/web/factsfigures.html

Figure 1.3 Low-quality tree annotations from the Open Images dataset.

(8)

Figure 1.4 Overview of the software pipeline.

each tree. This process is described in detailed in chapter 3. We then use the synthetic dataset to train two deep neural networks, namely Mask R-CNN (He, Gkioxari, et al. 2018) and RetinaNet (Lin, Goyal, et al. 2018), which have achieved state-of-the-art results on object detection and instance segmentation task on open datasets. With this process in mind, the thesis tries to answer two research questions:

• Research question 1: can we automatically construct a visually unnatural looking synthetic dataset that can be used to train a neural network which works on images taken from real life settings?

• Research question 2: given a test dataset of tree trunks taken from a forest, how do the state-of-the-art deep learning methods trained on the synthetic tree trunk dataset perform on the test dataset quantitatively?

After detecting 2D locations of the trees in the form of bounding boxes, we compute their 3D locations of using stereo images. The 3D location of a tree trunk is mapped onto its corresponding bounding box. To achieve this, we built a software to construct a synthetic dataset for an object, then apply the state-of-the-art DNN models for the object detection task and stereo matching for the reconstruction task. The overview of the pipeline is shown in Figure 1.4. This software can be used to create a synthetic dataset for any arbitrary objects that do not have publicly available high quality annotations, with tree trunk as an example object class. The code for the software is freely available on Github².

The structure of the thesis is as follow:

Chapter 2 provides the background knowledge needed for the thesis. It starts with an overview of neural networks, then goes deeper into convolutional neural networks (CNNs), the backpropagation algorithm, and common backbone CNN architectures for feature extractions. We also describe the COCO dataset format, which is used to construct the synthetic dataset. After that, we explain the architectures of the employed deep learning models, i.e. Mask R-CNN and RetinaNet.

2https://github.com/DK-Nguyen/custom-object-detection-and-3d-localization

(9)

Finally, we present the background for 3D reconstruction using stereo images and triangulation.

Chapter 3 first presents the process used to construct the synthetic dataset.

Then, it reports the choice of hyper-parameters and training procedure for the deep neural networks. After training the neural networks and getting the detections, we describe the process of finding and mapping depth information to the detected tree trunks. At the end of this chapter, we discuss the object detection metrics to assess the detection accuracy of the employed neural networks.

In chapter 4, we report the performance of the neural networks with regards to accuracy and time complexity, we also discuss the limitations as well as future directions for our work.

Finally, chapter 5 gives a brief overview of the whole thesis work, brings up some discussions based on the advantages and disadvantages of the proposed method, and finally offers some future directions for further research and development.

(10)

2 Object Detection and Depth Reconstruction

In this chapter, we present some background knowledge on neural networks for the object detection task as well as stereo image triangulation method for the depth reconstruction task.

2.1 Neural Networks

Artificial Neural Networks (ANN) is a set of algorithms that are inspired by bio- logical neural networks that constitute animal brains (Goodfellow, Yoshua Bengio, and Courville 2016). An ANN is comprised of a collection of smaller units called

“neurons” stacking on top of each other. The neurons in an ANN are heavily inter- connected.

ANN algorithms aim to approximate some functions f(x) = y, where x is a input vector, and y is the ground truth output of the function. For example, x can be a vector containing all the pixels of an image, and y is the class that the image belongs to, for example, “cat”, if the image contains a cat. An ANN can be parameterized as a function of the input xand a set of adaptive parameters θ:

f(x,θ) =y,ˆ (2.1)

where θ are modified by using an optimizer (Kingma and J. Ba 2017; Sutskever et al. 2013), andyˆis an approximation of the ground truthy. An ANN tries to find the best set of parameters θ that results in the closest values of yˆ compared toy.

2.1.1 Feedforward Neural Networks

Feedforward Neural Networks (FNNs) are ANNs that have a neuron on one layer connected to all neurons lying in neighbouring layers. Figure 2.1(left) shows a simple FNN which contains a hidden layer and an output layer.

The hidden layer takes the input vector x and calculates the hidden unit h as

h=f⁽¹⁾(x;W₁,c₁). (2.2)

Commonly,f⁽¹⁾ is an aﬀine transformation of x controlled by W₁ and c₁, followed by a non-linear activation function g₁ (Goodfellow, Yoshua Bengio, and Courville 2016), which could be written as

h=f⁽¹⁾(x;W₁,c₁) = g₁(W₁x+c₁), (2.3)

(11)

Figure 2.1 Left: A simple feed forward neural network. Right: the same network but with all neurons in the vectors shown.

where W₁ is the weight matrix, and c₁ is the bias vector. As can be observed in Figure 2.1 (right), each elementh_j of h can be calculated as

h_j =g₁(

∑n i=1

w^(j)_i x_i+c_j). (2.4) There are various options for the activation function g₁, such as Rectified Linear Unit (ReLU), Hyper-bolic Tangent (Tanh), Softmax, Sigmoid (Nwankpa et al. 2018;

Agarap 2019). Similarly, the estimated output valueyˆ is calculated as ˆ

y=f⁽²⁾(h;W₂,c₂) =g₂(W₂h+c₂), (2.5) where W₂ and c₂ are the output layer’s learnable parameters, and g₂ is another non-linear activation function. Hence, the complete FNNs can be expressed as a stack of functions:

ˆ

y=f(x,θ) = f(x;W₁,c₁,W₂,c₂) = f⁽²⁾(f⁽¹⁾(x,θ₁),θ₂), (2.6) where f⁽¹⁾ is called the first layer of the FNNs with parameters θ1 = {W1,c1}, and f⁽²⁾ is the second layer, whose parameters are θ2 = {W2,c2} (Goodfellow,

(12)

Yoshua Bengio, and Courville 2016). The depth of the FNNs is determined by the overall length of the function chain. Stacking different layers on each other helps FNNs learn to approximate more complicated functions by composing many simpler functions together, each one detects a particular pattern from the complex input signal (Goodfellow, Yoshua Bengio, and Courville 2016). Empirical results also show that neural networks with more layers have better generalization property on unseen data across many different tasks (Simonyan and Zisserman 2015; Goodfellow, Yoshua Bengio, and Courville 2016).

2.1.2 Convolutional Neural Networks

Now, if the input signal to the neural network is two dimensional, e.g. a grayscale image, the fully connected neural network discussed in the last section can be ex- tended to look like Figure 2.2. As each neuron in an output layer is connected to all neurons from the previous layer, FNNs treats all pixels in the input signal equally, and the number of parameters is going to be very large as the size of the input signal gets bigger. With this inherent structure, FNNs ignore the signal’s locality property, where local pixels are more similar to each other than far away pixels.

FNNs also do not take advantages of the input signal’s stationarity property, where certain patterns appear repeatedly throughout the signal (Y. LeCun et al. 1999).

Therefore, FNNs are computationally expensive and do not possess built-in invari- ance to scaling, translations, or geometric distortions of the input (Y. LeCun et al.

Figure 2.2 Fully connected neural networks on 2D data.

(13)

Figure 2.3 Top: the convolution operation with a 2D kernel applied on 2D input data.

Bottom: the same kernel moves two units to the right and does the same calculations to produce the next output unit.

1999).

Convolutional neural network (CNN) is a special kind of neural network that aim to improve upon the weaknesses of fully connected neural nets. Instead of using general matrix multiplication, CNNs employ the convolution operation between their layers, which is illustrated in Figure 2.3. The input 2d matrix in Figure 2.3 has zeros around, which is a commonly used trick called “zero padding” to preserve the spatial dimension of the output. The convolution operation between a 2D input I and a 2D kernelK is defined as

S(i, j) = (K⊛I)(i, j) =∑

m

∑

n

I(i+m, j +n)K(m, n). (2.7)

The output matrices of a convolution layer are often called “feature maps”, or sometimes “activation maps”, in computer vision (Y. LeCun et al. 1999). With convolution (conv), each neuron in an output layer is only connected to a small set of neighboring neurons in the previous layer. This local neighborhood is often called

“receptive field”. Local receptive field is effective because it exploits the locality property of the input signals (Y. LeCun et al. 1999). Furthermore, the same kernel of weights, e.g. w₁, . . . , w₉ in Figure 2.3, is swiped across the input. This property is often referred to as “weight sharing”, effectively reduces a vast amount of weights that need to be learned by the network. Weight sharing does not only reduce the

(14)

amount of computation cost to train a neural network, it also helps the network learn better features by discovering repeated patterns across the signal (Goodfellow, Yoshua Bengio, and Courville 2016; Y. LeCun et al. 1999).

Let’s discuss in more details the dimensions of the input signal, the weight matrix, and the output signal of a convolution layer. If the input to a convolution layer is an RGB image with dimension[32×32×3], and the receptive field is chosen to be 5×5, then each kernel of the conv layer will have size [5×5×3]. This means that each neuron in the output feature map will be connected to a[5×5×3]volume from the input through a weight kernel of the same size, and this weight kernel is shared across all spatial locations. The size of the output signal also depends on how much zeros padding is done to the input signal, and how many units the kernel moves in one step. Generally, if the input to a conv layer has size[W₁×H₁×D₁], the receptive field (or kernel size) is[F×F], there are K kernels used, the spatial step size (also called “stride”) of the kernel is S, and the amount of zero padding is P, then the output of the conv layer will have size [W₂×H₂×D₂], where

W₂ = W1−F + 2P

S + 1,

H₂ = H₁−F + 2P

S + 1,

D₂ =K.

(2.8)

D₂ is the depth, or number of channels of the output activation maps. This concept is nicely explained in a lecture note from a Stanford University’s course on neural networks and computer vision (cs231n 2021).

1×1 convolution is a trick widely used in modern CNN architectures. It simply means that the kernel is of size1×1, or a single number, instead of a matrix like in the case of3×3 convolution. If the stride is chosen to be S = 1and the padding is P = 0, this1×1 kernel will convolve over the entire input image pixel by pixel. We can substitute these information into Equations 2.8 and have

W₂ = W₁−1 + 0

1 + 1 =W₁, H₂ = H₁−1 + 0

S + 1 =H₁, D₂ =K.

(2.9)

The equations in 2.9 demonstrate the core idea of1×1convolution: it is used to keep the spatial dimensions of the input but change the number of output channels by changing the number of kernelsK. 1×1convolution can be used for dimensionality reduction, hence scaling down the computational cost (Szegedy et al. 2014a). It can also be used to manipulate the number of channels of feature maps to build new and

(15)

deeper network architectures (He, Xiangyu Zhang, et al. 2015; Lin, Dollár, et al.

2017).

We have seen how CNNs can reduce significantly the number of parameters need to be trained compared to simple feed forward nets. There are a few other tricks to even further reduce computational cost, but at the same time improves the accuracy of the CNNs such as pooling (Scherer, Müller, and Behnke 2010) and batch normalization (Ioffe and Szegedy 2015), which we are going to discuss next.

Pooling

Convolution layers have been proven to be effective in learning features with different levels of abstractions (Zeiler and Fergus 2013). However, as feature maps record the precise position of features in the input, they are sensitive to changes in small position movements of the features from the input images, for example with rotation, shifting, etc. Pooling operations were introduced to address this problem, and they are similar to subsampling in signal processing. By creating a downsampled (or pooled) feature map, the pooling operations create a lower resolution but more compressed version of the input feature map. These compressed feature maps only keep the crucial feature components and add a small amount of invariant features to the network (Scherer, Müller, and Behnke 2010). Two examples of the pooling operation are visualized in Figure 2.4.

The ideas of local receptive field, weight sharing, together with pooling form the backbone architectural ideas of CNNs that lead to its effectiveness and prominent presence in a wide range of applications. Nowadays, one can find CNN methods in computer vision (L. Liu et al. 2019), natural language processing (Wang and Gang

Figure 2.4 2D max and average pooling.

(16)

2018), speech recognition (Abdel-Hamid et al. 2014), bioinformatics (Min, Lee, and Yoon 2016), medical information processing (Yamashita et al. 2018), among many other tasks (Alzubaidi et al. 2021).

Batch Normalization

The phenomenon where the distributions of output activations in a neural network change for different layers during training is referred to as “internal covariate shift” (Ioffe and Szegedy 2015). As the output of a layer is the input to another, this phenomenon slows down the training process as well as makes it harder. It has been known that fixing these distributions during training will lead to faster convergence (Wiesler and Ney 2011). Batch Normalization, also called Batch Norm or BN, aims to reduce the internal covariate shift by fixing the means and variances of the activations along the batch dimension (Ioffe and Szegedy 2015). More specifically, if a particular activation x in a mini-batch has m values, denoted as B = {x_1...m}, then the batch normalization transform can be written as

BN_γ,β :x_1...m →y_1...m, (2.10)

where γ and β are learnable parameters. The process of batch normalization is as following:

• First, the mean and variance of the mini-batch are calculated as

µ_B = 1 m

∑m i=1

x_i, (2.11)

σ_B² = 1 m

∑m i=1

(x_i−µ_B)². (2.12)

• Then, the normalized valuexˆ_i is calculated as ˆ

x_i = xi−µ_B

√σ²_B+ϵ, (2.13)

where ϵ is a small number to prevent dividing by 0.

• Finally, the output of the batch normalization is y_i is scaled and shifted by γ and β:

y_i =γxˆ_i+β. (2.14)

Batch Norm has proved to be effective in reducing the training time and number of iterations. It also enables training with higher learning rates, less careful initialization, and can replace Dropout (Ioffe and Szegedy 2015). Recently, there are

(17)

new methods proposed to improve upon Batch Normalization, for example, Group Normalization (Wu and He 2018), Instance Normalization (Ulyanov, Vedaldi, and Lempitsky 2017), and Layer Normalization (J. L. Ba, Kiros, and Hinton 2016).

2.1.3 Backpropagation Algorithm

In section 2.1.1, from the input vector x, the neural network computes the approx- imated output vectory. This process is called “forward propagation” (Goodfellow,ˆ Yoshua Bengio, and Courville 2016). Now, if we have the ground truth output vector y, we can calculate the difference betweenyˆand y by using a function J(y,ˆ y), as illustrated in Figure 2.5. J is often called the “loss function”, or “cost function”.

The objective of a neural network is to minimize the loss function to produce the best approximationyˆcompared toy. As yˆis calculated from the fixed input vector xand a set of learnable parametersθ ={W₁, c₁, W₂, c₂}, the loss function can be parameterized asJ(θ).

In neural networks, the minimal value of the loss function is found through an it- erative process called “gradient descent” (Goodfellow, Yoshua Bengio, and Courville 2016). At a pointθ_i,J(θ)increases most quickly along the direction of the gradient

∂J(θ)/∂θ_i, provided that the gradient exists and is non-zero. The main idea of gradient descent is to minimize J(θ) by repeatedly movingθ to the opposite direction of ∂J(θ)/∂θ (Rumelhart, Hinton, and Williams 1986). The process of calculating the gradients of the loss function with respect to the weights in each layer of a neural network is called “backpropagation” (Rumelhart, Hinton, and Williams 1986).

More specifically, as in Figure 2.5, the gradient of J(θ)with respect to the weight

Figure 2.5 The backpropagation process for the FNNs shown in Figure 2.1.

(18)

matrices W₂ and W₁ are computed using the chain rule as

∂J(θ)

∂W₂ = ∂J(θ)

∂yˆ

∂W₂, (2.15)

∂J(θ)

∂W₁ = ∂J(θ)

∂yˆ

∂h

∂W₁. (2.16)

Based on the calculated gradients, the weights in the next iterations are updated as W₂^k+1 =W₂^k−γ∂J(θ)

∂W₂^k, (2.17)

W₁^k+1 =W₁^k−γ∂J(θ)

∂W₁^k, (2.18)

where k denotes the number of iteration, γ is a hyperparameter called “learning rate” that defines how much the weights should be updated in each iteration.

Stochastic Gradient Descent (SGD) refers to the process where we calculate the gradients and update the model’s weights for each training example in the training dataset. However, updating the model so frequently can lead to a very long training process and noisy gradient signal (Goodfellow, Yoshua Bengio, and Courville 2016).

To solve this problem, we can use mini-batch gradient descent, which splits the training dataset into small batches. Each batch of data contains multiple training examples. We only calculate the model error and update its coeﬀicients after each batch. How big a batch can be is defined by a hyperparameter called “batch size”.

2.1.4 Convolutional Neural Networks as Backbones for Fea- ture Extraction

In computer vision, a feature is a piece of information that reveals the content of an image, such as edges, points, objects, etc. Before the rise of deep learning methods for feature engineering, handcrafted features such as scale-invariant feature transform (SIFT) (D. Lowe 2004), histogram of oriented gradients (HOG) (Dalal and Triggs 2005b) or Haar-like features (Lienhart and Maydt 2002) are mostly used in computer vision systems. However, after the success of convolutional neural networks in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (Krizhevsky, Sutskever, and Hinton 2012) in 2012, CNNs have become the main backbone method to extract features for computer vision tasks (Benali Amjoud and Amrouch 2020). Some popular CNN backbone architectures are:

• AlexNet (Krizhevsky, Sutskever, and Hinton 2012): AlexNet consists of 5 convolutional layers and 3 fully connected layers. The Rectified Linear Unit (ReLU) was used for the first time as the activation function instead of sigmoid

(19)

and tanh functions to add non-linearity. By attaining the first place in the ILSVRC-2012 competition and preceding the second method by a big margin, AlexNet paved way for the widely adoption of CNNs in computer vision tasks, and is used to extract features in models such as R-CNN (Girshick et al. 2014), HyperNet (Kong et al. 2016).

• VGG-16 (Simonyan and Zisserman 2015): inspired by AlexNet, VGG-16 is a deeper network and consists of 16 layers in total, out of which 13 are convolutional layers with ReLU activation, followed by 3 fully connected layers.

Instead of using large receptive fields in the first convolutional layers, e.g.

11×11 with stride 4 in AlexNet (Krizhevsky, Sutskever, and Hinton 2012), the authors of VGG-16 use smaller receptive fields of 3×3 with stride 1 for the whole network. VGG-16 also incorporates the use of 1×1 convolutional layers to add more non-linearity without affecting the receptive fields of the convolutional layers. A main contribution of VGG-16 is showing the effectiveness of deeper neural networks with smaller convolution filters. VGG-16 is one of the most popular backbone architectures and used in models such as Fast R-CNN (Girshick 2015), Faster R-CNN (Ren et al. 2016), SSD (W. Liu et al. 2016).

• ResNet (He, Xiangyu Zhang, et al. 2015): deeper neural networks are sus- ceptible to the degradation problem, where accuracy becomes saturated and then falls off quickly when the depth of the network increases. ResNet was proposed to solve this problem. The main idea of ResNet is to use skip connections to force the convolutional layers learn a residual mapping. There are multiple variants of ResNet with different depths. For example, ResNet50 has 50 layers and takes 3.8×10⁹ floating point operations per second (FLOPs), ResNet101 consists of 101 layers and 7.6×10⁹ FLOPs, ResNet152 contains 152 layers and takes11.3×10⁹ FLOPs. ResNet is the workhorse of many modern object detection frameworks, such as Mask R-CNN (He, Gkioxari, et al.

2018), RetinaNet (Lin, Goyal, et al. 2018), Faster R-CNN (Ren et al. 2016), R-FCN (Dai et al. 2016), etc.

• Feature Pyramid Network (FPN) (Lin, Dollár, et al. 2017): detecting objects at different scales has long been a challenge in computer vision. Pyramid method is one of the standard solutions to this problem (Adelson et al. 1984).

However, using pyramid representations in DNNs are time and computationally expensive. FPN is a fast, computational effective feature extractor designed to compute the pyramid feature maps in a fully convolutional fashion.

FPN is used in models such as Mask R-CNN (He, Gkioxari, et al. 2018), RetinaNet (Lin, Goyal, et al. 2018).

(20)

Since the DNN methods for object detection used in this work (Mask R-CNN and RetinaNet) have ResNet and FPN as their backbone feature extractors, these architectures will be studied in more details in the following sections.

Deep Residual Neural Network (ResNet)

Deeper neural networks are shown to be able to enrich the levels of feature representations that can be learned from input images (Zeiler and Fergus 2013). Moreover, (Simonyan and Zisserman 2015) and (Szegedy et al. 2014b) show that depth is a crucial element in the performance of neural networks. However, when the networks get deeper, two main challenges arise, which are the vanishing/exploding gradients (Y. Bengio, Simard, and Frasconi 1994) and the degradation problem (He and Sun 2014).

The vanishing/exploding gradients problem refers to when the gradient of the loss function with respect to the network’s weights become too small or big, pre- venting the weights from changing their values. In very deep neural networks, the vanishing/exploding gradients problem hinders the convergence from the beginning, and has been tackled by normalized initialization (Y. A. LeCun et al. 2012; Glorot and Yoshua Bengio 2010) as well as intermediate normalization layers (Ioffe and Szegedy 2015).

After the vanishing/exploding gradients problem was solved, the degradation problem emerges. It led to higher training error when more layers are added due to diﬀiculties when propagating information from shallower layers to deeper layers (He and Sun 2014; He, Xiangyu Zhang, et al. 2015). The degradation problem can be explained better using an example: consider two neural networks, one withn layers, and another deeper one with m layers, where m > n. If the deeper network can learn to make its first n layers produce the same representations with the n layers of the shallower network, and each of the remaining m −n layers simply outputs whatever it takes as input without changing anything, which is called “identity mapping”, then the deeper network is expected to perform as least as well as the shallower network. However, experimental results have shown that it is hard for deeper neural networks to learn these identity mappings (He, Xiangyu Zhang, et al. 2015), leading to the degradation problem. In order to address this, ResNet uses shortcut connections as its building block, as illustrated in Figure 2.6. The main idea of shortcut connections is to help information flow unimpeded through the entire network. With this modification, the authors demonstrated that training much deeper networks results in further increase in accuracy.

If the underlying function that needs to be learned is denoted as H(x), then ResNet lets the non-linear layers learn the residual function F(x), after that per- forms the shortcut connection to make H(x) = F(x) + shortcut(x), as shown in

(21)

Figure 2.6 A building block of ResNet.

Figure 2.6. In the context of ResNet,xand F(x) are matrices; if their dimensions are the same, the shortcut connection simply copiesx, and H(x)can be defined as H(x) =F(x;{W_i}) +x, (2.19) whereF(x;{W_i}) is the residual function to be learned, with its parameters{W_i}. Ifx and F(x) have different dimensions, then the shortcut function can be implemented as a learnable layer, or

H(x) = F(x;{W_i}) +W_sx. (2.20) For ResNet with 50 and 101 layers, the residual function F is implemented as a stack of three convolution layers, each one has a batch normalization and ReLU activation. This building block is called the “bottleneck block” as in (He, Xiangyu Zhang, et al. 2015) and is illustrated in Figure 2.7.

ResNet 50 and 101 are built from these bottleneck blocks, as shown in Figure 2.8. They consist of five convolution stages (stem, res2, res3, res4, res5), followed by a fully connected stage. The first convolution stage consists of a 7×7 convolution layer with batch normalization, ReLU activation and max pooling. The last four convolution stages are implemented by stacking the bottleneck blocks on top of each other. The outputs of the convolution stages then go through an average pooling to produce the downsampled feature maps, which are then flattened into feature vectors. Finally, the fully connected (FC) layer takes these feature vectors to produce the final outputs. The implementations of 50-layer and 101-layer ResNet

Figure 2.7 The bottleneck block of 50-layer and 101-layer ResNet. Conv i×i means a convolution layer with kernel size of i×i.

(22)

Figure 2.8 ResNet50 and ResNet101.

from the API Detectron2 can be found on Github¹. Feature Pyramid Network (FPN)

Feature pyramids are important for detecting objects at different scales (Lin, Dol- lár, et al. 2017). Although with convolutional layers and pooling, deep CNNs can compute feature maps in a hierarchical and pyramidal structure, these feature maps contain large semantic gaps caused by different depths. Furthermore, using pyramid representations in DNNs are computational and memory intensive. To improve upon this matter, FPN (Lin, Dollár, et al. 2017) builds a feature pyramid by ex- ploiting the inherent multi-scale, pyramidal hierarchy of DNNs. The construction of FPN involves a bottom-up pathway and a top-down pathway, as illustrated in Figure 2.9. With this architecture, FPN eﬀiciently constructs feature pyramids from a single-scale image with marginal extra cost.

1https://git.io/JWkqJ

Figure 2.9 FPN overall architecture.

(23)

The bottom-up pathway is a backbone CNN for feature extraction. In this thesis and in (Lin, Dollár, et al. 2017), ResNet is used. The way that FPN utilizes ResNet to extract the features pyramid is demonstrated in Figure 2.10. As explained in 2.1.4, there are 5 convolution stages in ResNet. The output feature maps of the four residual stages (res2, res3, res4, res5) are chosen to be the set of feature maps in the bottom-up pathway for FPN. In Figure 2.10, they are denoted as {C2;C3;C4;C5}. Going deeper into the ResNet, i.e. from C2 to C5, the spatial resolutions of the feature maps decrease by half after each stage. However, lower resolution feature maps contain better semantic values, since after each convolution stage, more high-level structures are detected by the network (Lin, Dollár, et al.

2017). Using these outputs from the backbone ResNet, FPN adds a top-down pathway. First, M5 is created by putting C5 through a1×1 convolution. Other feature maps in the top-down pathway, namely{M2;M3;M4}, are created by

M_i =Up(M_i+1) +Conv1×1(C_i), (2.21) where i ∈ {4,3,2}, Up is the nearest-neighbor interpolation upsampling function with scale factor of 2. Finally, 3×3 convolution layers are applied to all the top- down feature maps to reduce the aliasing effect of upsampling (Lin, Dollár, et al.

2017) and create the final feature maps {P2;P3;P4;P5}. FPN’s implementation can be found at².

2.2 Object Detection Using Deep Neural Networks

Object detection is one of the most fundamental tasks in computer vision, and has a long history of active research, dating back to several decades (Fischler and Elschlager 1973). Object detection aims to determine the spatial location and the category of an object, such as human, cars, dogs, if it presents in a digital image or video (Dasiopoulou et al. 2005; Xin Zhang et al. 2013). Object detection supports a wide range of applications, including intelligent video surveillance, secu- rity, robot vision, autonomous driving, consumer electronics, content based image retrieval, augmented reality, human-computer or computer-computer interactions, just to name a few (L. Liu et al. 2019).

In the past decade, deep learning (DL) techniques have risen to be a powerful method for learning feature representations from data, including 2D data from images (Hinton and Salakhutdinov 2006; Goodfellow, Yoshua Bengio, and Courville 2016). Object detection using deep learning can be organized into two main categories:

• Two-stage detection: DNN architectures that solve the object detection task

2https://git.io/JW3la

(24)

using 2 stages, each one employs a different neural network. The first stage involves extracting the proposals for possible objects. In the second stage, these proposals are classified into specific object categories. Popular two- stage detection methods are SPP-net (He, Xiangyu Zhang, et al. 2014), R- CNN (Girshick et al. 2014), Fast R-CNN (Girshick 2015), Faster R-CNN (Ren et al. 2016), Mask R-CNN (He, Gkioxari, et al. 2018).

• One-stage detection: refers to DNN architectures that simultaneously pre- dict object location and object category. Since one-stage methods only utilize a single neural network, they offer faster detection speed compared to two- stage methods, however, most one-stage methods perform worse in terms of prediction accuracy (L. Liu et al. 2019; X. Jiang et al. 2019b). OverFeat (Ser- manet et al. 2014), YOLO (Redmon et al. 2016), SSD (W. Liu et al. 2016), RetinaNet (Lin, Goyal, et al. 2018) are some representative methods in this category. RetinaNet, introduced in 2018, was able to match the speed of one-

Figure 2.10 FPN with ResNet backbone.

(25)

stage method, meanwhile surpassed the accuracy of state-of-the-art two-stage methods at that time (Lin, Goyal, et al. 2018).

This thesis employs two DNN architectures: Mask R-CNN (He, Gkioxari, et al. 2018) that represents the two-stage method, and RetinaNet (Lin, Goyal, et al. 2018) that represents the one-stage method.

2.2.1 COCO Dataset

In this thesis work, the data used to train the object detection deep learning methods follow the Microsoft COCO format. The COCO dataset is a large-scale dataset that was introduced in 2015, focusing mainly on three research problems in scene under- standing: object detection in non-iconic views, object localization in 2D space, and multi-object contextual reasoning (Lin, Maire, et al. 2015). The COCO dataset contains about 330k images, with more than 2.5 million object instances annotated. The annotation pipepline is split into three tasks: category labeling, instance spotting, instance segmentation, and was conducted using workers on Amazon’s Mechanical Turk (Lin, Maire, et al. 2015). The annotations of the COCO dataset is formatted in JSON³, and consists of several annotation types for different applications: object detection, keypoint detection, panoptic segmentation, etc. For object detection, the annotation is a collection of 5 fields: “info”, “licenses”, “images”, “annotations”,

“categories”, as shown in Figure 2.11. More specifically, we have:

• The “info” section contains high level information about the dataset, such as the description, date created, version.

• The “licenses” section contains a list of image licenses that apply to the images in the dataset.

• The “images” section contains the list of images, each one has a unique id, together with its width, height, name, date captured, etc.

• The “annotations” section contains important information for training the DNNs, where the “image_id” field shows the id of the corresponding image in the dataset. The “category_id” corresponds to a single category specified in the “categories” section. The “segmentation” field is a list of polygon vertices around the object that indicate the regions of interest, used when the field

“iscrowd” is 0. The “segmentation” field can also be a run-length-encoded (RLE) bit mask, if “iscrowd” is 1, which is not used in this project. “area”

shows the annotation area measured in pixels. The bounding boxes for ground- truth objects specified in “bbox” have the format [top left x position, top left y position, width, height].

3https://cocodataset.org/#format-data

(26)

Figure 2.11 The COCO dataset annotation format

Figure 2.12 Some examples of images annotated with the COCO dataset format.

• The “categories” section contains the list of categories for annotated objects, each on has an id and name, such as “2” and “bicycle”, as well as its “super- category”, such as “vehicle”.

Figure 2.12 shows some examples of annotated images in the COCO dataset format. We can see that the bike and each elephant has its own bounding box and segmentation annotations.

(27)

2.2.2 Mask R-CNN

Mask R-CNN (He, Gkioxari, et al. 2018) is a two-stage DNN method for instance segmentation and object detection. Introduced in 2016, Mask R-CNN is the COCO challenge winner of the object detection and instance segmentation task. Since then, Mask R-CNN has been used widely in many applications, for example in landslide detection (Ullo et al. 2020), object detection using optical remote sensing images (Mahmoud et al. 2020), or 3D pulmonary diagnosis (Cai et al. 2020).

The major components of Mask R-CNN are:

• Backbone network: a base CNN model for feature extraction. In this thesis, the backbone network is ResNet-FPN.

• Region Proposal Network (RPN): a network responsible for identifying regions in images that potentially contain objects that need to be detected, called Regions of Interest (RoI).

• Detection networks: contains RoI Pooling and neural network heads to create final predictions. RoI Pooling is a module responsible for extracting features from the backbone network based on the RoI proposals output of RPN. From there, the box head generates predicted bounding boxes and classes, and the mask head produces predicted masks for the detections.

The overall architecture of Mask R-CNN with ResNet-FPN backbone is illustrated in Figure 2.13. The implementation of Mask R-CNN can be found on Github⁴.

Figure 2.13 Mask R-CNN model architecture.

4https://git.io/JW3G0

(28)

Backbone Network (using Features Pyramid Network)

In Mask R-CNN, FPN with ResNet backbone is used to extract the features hierarchy from an input image. These feature maps are then used in later stages, i.e. RPN and detection networks. First, FPN takes an image of size [B, C_in, H, W], where B is the number of images per batch (batch size), C_in is the number of channels of the input images, e.g. C_in = 1 for grey scale images and C_in = 3 for RGB images, H and W are the height and width of the images. The output of FPN is the set of feature maps {P2;P3;P4;P5;P6} with the same number of channels, the spatial resolutions decrease by half going up the levels of the pyramid. More specifically, the dimensions of the feature maps are: P₂: [B, C_out,^H₄,^W₄ ], P₃: [B, C_out,^H₈,^W₈ ], …, P₆: [B, C_out,₆₄^H,^W₆₄]. By default, C_out = 256. The internal mechanism of how FPN works is explained in section 2.1.4.

Region Proposal Network

Region Proposal Network (RPN) was first introduced in Faster R-CNN (Ren et al.

2016) and allows proposing regions of interest to be a fully trained process. In Mask R-CNN, RPN takes the feature maps {P2;P3;P4;P5;P6} produced by FPN as input and outputs a set of proposed bounding boxes, each one has an objectness score that measures the existence of an object versus background. Furthermore, RPN also uses the ground truth bounding boxes from the training dataset to calculate the losses.

Anchors are bounding boxes of various sizes from which the proposal RoIs will be chosen from. First, for each feature map in{P2;P3;P4;P5;P6}, a set of anchors are created. It is designed so that multiple anchors share the same center. As in (Lin, Dollár, et al. 2017), the three anchor ratios are{1 : 2, 1 : 1, 2 : 1}. Figure 2.14 (left) shows some examples of anchors for P2 and P3. The square anchors (ratio1 : 1) for P2 have the sizes of 32×32. The sizes of the anchors double after each layer from P2 to P6, that are 64×64 for P3, 128×128 for P4, and so on. Then, the anchors are placed on the corresponding grid cells whose sizes are the same as the feature maps, as illustrated in Figure 2.14 (right).

To determine the similarities of the anchors to ground truth boxes, a set of pairwise Intersection over Union (IoU) (Jaccard 1901) scores are calculated. In the context of object detection, IoU is defined as the fraction whose numerator is the intersection area of the predicted bounding boxB_p and the ground truth bounding boxB_gt, and the denominator is their union (Padilla, Netto, and Silva 2020), or

J(B_p, B_gt) = IoU= area(Bp∩Bgt)

area(B_p∪B_gt). (2.22)

(29)

Figure 2.14 Left: example anchors for P2 and P3 feature maps. Right: placing the anchors for P5 on the grid with the same size of the P5 feature map.

IoU can be visualized as in Figure 2.15, and its value is in between 0 and 1. If IoU is bigger than 0.7, the anchor box is assigned to be ‘foreground’ (class ‘1’, denoted p^∗ = 1). If IoU is smaller than 0.3, the anchor is labeled to be ‘background’ (class

‘0’, denotedp^∗ = 0). Otherwise the anchor is ignored (class ‘-1’).

The parameters that represent the relation of an anchor to the corresponding ground truth bounding box, called ‘anchor deltas’, are then calculated as

t^∗_x = (x^∗−x_a)/w_a, t^∗_y = (y^∗−y_a)/h_a,

t^∗_w =log(w^∗/w_a), t^∗_h =log(h^∗/h_a), ( 2.23) where x^∗, y^∗ are the ground truth box’s center coordinates, w^∗, h^∗ are its width and height; x_a, y_a, w_a, h_a are the anchor’s corresponding measurements. This is demonstrated in Figure 2.16.

The feature maps {P2;P3;P4;P5;P6} are then fed to the network one by one, as illustrated in Figure 2.17. First, each feature map goes through a3×3convolution layer which has 256 filters by default. Then, from the bounding box classification branch, which is a 1×1 Conv layer, RPN outputs the predicted objectness scores.

These output scores have dimension[B,3, H_i, W_i], whereB is the batch size, H_i, W_i

Figure 2.15 Visualization of IoU.

(30)

Figure 2.16 Illustration for anchor delta calculations.

are the height and width of the corresponding input feature map, and 3 indicates that at each pixel there are3anchors. The bounding box regression branch outputs the predicted anchor deltas:

t_x = (x−x_a)/w_a, t_y = (y−y_a)/h_a,

t_w =log(w/w_a), t_h =log(h/h_a), ( 2.24) where x, y, w, h are the predicted box’s center coordinates, its width and height, respectively. The predicted anchor deltas have dimension [B,3∗4, H_i, W_i], where 3∗4 indicates that at each pixel there are 3predicted vectors, each one has length 4.

The bounding box regression loss between a predicted bounding box and a ground truth bounding box is a smoothL₁ loss, defined in (Girshick 2015) as

L^rpn_box(t, t^∗) = ∑

i∈{x,y,w,h}

smooth_L₁(t_i−t^∗_i), (2.25)

Figure 2.17 Region Proposal Network.

(31)

where

smooth_L₁(x) =





0.5x² if |x|<1

|x| −0.5 otherwise

The proposal boxes are generated by combining the predicted anchor deltas with the anchors. A post-processing step is then applied to filter the proposal boxes, using non-maximum suppression. Finally, during training, 2000 highest-score boxes are chosen to be the final outputs of RPN, and during testing, 1000 boxes are chosen.

The implementation of RPN can be found at⁵. Region of Interest (RoI) Heads

RoI Heads is the final stage of Mask R-CNN; it aims to produce the predicted bounding boxes with assigned class as well as the predicted masks for the detected objects. The inputs of RoI Heads are the set of feature maps {P2;P3;P4;P5} and the proposal boxes from RPN.

First, the proposal boxes are used to crop features from the feature maps. As there are multiple feature maps with different sizes, the boxes are assigned to the appropriate feature maps P_k according to the rule from FPN paper (Lin, Dollár, et al. 2017)

k =⌊k₀+ log₂(

√proposal box area

224 )⌋, (2.26)

where ⌊x⌋ is the floor function, 224 is the canonical pre-training size of ImageNet, and k₀ is the target level on which an RoI with area= 224² should be mapped to.

According to (Lin, Dollár, et al. 2017), k₀ is set to 4 as they use C₄ as the single scale feature map. Intuitively, if the box area is smaller than 224², for example proposal box area = 112², then the box should be mapped to level k = 3, which has finer resolution. This process is demonstrated in the first step of Figure 2.18.

The implementation of the assigning function can be found via this link⁶. Since the proposal boxes are in the size of the input image, they need to be scaled down to the size of the corresponding feature map. Additionally, proposal boxes can have different shapes, sizes and ratios. RoI Pool (Girshick 2015) aims to extract fixed-size feature matrices from the feature maps. RoIAlign was proposed in Mask R-CNN as the improved version of RoI Pool and can accurately crop the features by the proposal boxes with floating-point coordinates. The results of RoI Align are the small feature maps of the same size, as illustrated in Figure 2.18.

After RoI Align, the cropped features are fed to the head networks. For Mask R- CNN, there are two head branches: the box head that does bounding box regression and classification, and the mask head that predicts the segmentation masks. The

5https://git.io/JW3O2

6https://git.io/JWaSU

(32)

Figure 2.18 RoI Align.

cropped feature maps (output of RoI Align) for the box head by default have the spatial resolution of7×7, and for the mask head they are 14×14.

The box head consists of four fully-connected layers, two for extracting features, one for classification and one for bounding box regression, as illustrated in Figure 2.19. It takes the cropped feature maps from RoI Align and feed to the FC layers.

Before being fed to the FC layers, the cropped feature maps are flattened using PyTorch’s flatten function⁷. Then, the first two FC layers map the number of channels from 256 to 1024. The final output matrix that contains the predicted classification scores has size [Ndet, Nc + 1], where Ndet is the number of bounding box detections, Nc is the number of classes in the training dataset. We haveNc+ 1 final classes because there is a default background class. The output matrix for the predicted boxes have the size [Ndet, Nc ∗4]. The boxes then go through the post-processing step, where during inference, only top 100 boxes are kept as valid

7https://pytorch.org/docs/stable/generated/torch.nn.Flatten.html

Figure 2.19 Mask R-CNN’s box head.

(33)

Figure 2.20 Mask R-CNN’s mask head.

detections. Similar to RPN, the bounding box regression loss is the smoothL₁ loss as in (Ren et al. 2016), and the classification loss is the softmax cross entropy loss.

These losses are then added to the losses of RPN to produce the Mask R-CNN’s final classification and regression loss, i.e. L_box=L^rpn_box +L^head_box , L_cls =L^rpn_cls +L^head_cls . The implementation of the box head can be found at⁸.

Figure 2.20 shows the architecture of the Mask R-CNN’s mask head. First, the output of RoI Align goes through a stack of four3×3convolution layers with ReLU activations, each one has 256 filters by default. The output feature maps are then upsampled by a deconvolution layer, before finally going through a1×1convolution layer which hasN_c filters. The final output of the mask head isN_c predicted masks for each RoI, each one has the spatial dimension of 28×28. The loss L_mask is the average binary cross-entropy loss. The total multi-task loss of Mask R-CNN for each RoI isL=L_box+L_cls+L_mask. (He, Gkioxari, et al. 2018). During inference phase, for each RoI, only the mask corresponds to the the class predicted by the box head is used. The implementation of the mask head can be found at⁹.

2.2.3 RetinaNet

As mentioned in 2.2, RetinaNet (Lin, Goyal, et al. 2018) belongs to the one-stage DL object detection methods. The architecture of RetinaNet comprises of a backbone network, an anchors generator, and the subnetworks for classification and regression, as illustrated in Figure 2.21. Similar to Mask R-CNN, RetinaNet also employs FPN-ResNet backbone for extracting the features pyramid. Unlike Mask R-CNN, RetinaNet does not have a region proposal network to filter out good regions of interests. Therefore, RetinaNet’s detectors have to evaluate thousands of candidate locations per image directly, but only a few of them contain the target objects. This causes the problem of large class imbalance during training.

RetinaNet tries to address the class imbalance problem using a custom loss function called Focal Loss (FL) (Lin, Goyal, et al. 2018), which is an extension of the

8https://git.io/JWp0y

9https://git.io/J4rDy

(34)

Figure 2.21 RetinaNet’s architecture.

cross entropy (CE) loss. For binary classification, CE is defined as

CE(p, y) =





−log(p) if y= 1

−log(1−p) otherwise , (2.27) where y ∈ {±1} is the ground truth class, and p ∈ [0,1] is the model’s output probability for the class y= 1. For notation convenience,p_t is defined as

p_t =





p if y= 1

1−p otherwise . (2.28)

Therefore, CE becomes:

CE(p, y) = CE(p_t) = −log(p_t) (2.29) To balance between positive and negative classes, a balance term α ∈ [0,1] was introduced for class 1 and (1−α) for class −1. α_t is also defined similarly to p_t. The α-balanced CE is written as

CE(p_t) =−α_tlog(p_t), (2.30) where in practice,α can be the inverse class frequency (Lin, Goyal, et al. 2018).

While α can balance the importance of positive/negative class, it still fails to differentiate between easy/hard classified examples. Easy classified examples could be the anchor boxes that contain background, and as there are many more of them than the boxes that contain the objects, they can dominate the loss and overwhelm the gradients, leading to diverged training. To tackle this problem, a modulating factor (1−p_t)^γ is added to the cross entropy loss, forming the FL loss:

FL(p_t) = −α_t(1−p_t)^γlog(p_t), (2.31)