Convolutional Neural Networks as Backbones for Feature Ex-

2.1 Neural Networks

2.1.4 Convolutional Neural Networks as Backbones for Feature Ex-

In computer vision, a feature is a piece of information that reveals the content of an image, such as edges, points, objects, etc. Before the rise of deep learning methods for feature engineering, handcrafted features such as scale-invariant feature transform (SIFT) (D. Lowe 2004), histogram of oriented gradients (HOG) (Dalal and Triggs 2005b) or Haar-like features (Lienhart and Maydt 2002) are mostly used in computer vision systems. However, after the success of convolutional neural networks in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (Krizhevsky, Sutskever, and Hinton 2012) in 2012, CNNs have become the main backbone method to extract features for computer vision tasks (Benali Amjoud and Amrouch 2020). Some popular CNN backbone architectures are:

• AlexNet (Krizhevsky, Sutskever, and Hinton 2012): AlexNet consists of 5 convolutional layers and 3 fully connected layers. The Rectified Linear Unit (ReLU) was used for the first time as the activation function instead of sigmoid

and tanh functions to add non-linearity. By attaining the first place in the ILSVRC-2012 competition and preceding the second method by a big margin, AlexNet paved way for the widely adoption of CNNs in computer vision tasks, and is used to extract features in models such as R-CNN (Girshick et al. 2014), HyperNet (Kong et al. 2016).

• VGG-16 (Simonyan and Zisserman 2015): inspired by AlexNet, VGG-16 is a deeper network and consists of 16 layers in total, out of which 13 are con-volutional layers with ReLU activation, followed by 3 fully connected layers.

Instead of using large receptive fields in the first convolutional layers, e.g.

11×11 with stride 4 in AlexNet (Krizhevsky, Sutskever, and Hinton 2012), the authors of VGG-16 use smaller receptive fields of 3×3 with stride 1 for the whole network. VGG-16 also incorporates the use of 1×1 convolutional layers to add more non-linearity without affecting the receptive fields of the convolutional layers. A main contribution of VGG-16 is showing the effec-tiveness of deeper neural networks with smaller convolution filters. VGG-16 is one of the most popular backbone architectures and used in models such as Fast R-CNN (Girshick 2015), Faster R-CNN (Ren et al. 2016), SSD (W. Liu et al. 2016).

• ResNet (He, Xiangyu Zhang, et al. 2015): deeper neural networks are sus-ceptible to the degradation problem, where accuracy becomes saturated and then falls off quickly when the depth of the network increases. ResNet was proposed to solve this problem. The main idea of ResNet is to use skip con-nections to force the convolutional layers learn a residual mapping. There are multiple variants of ResNet with different depths. For example, ResNet50 has 50 layers and takes 3.8×10⁹ floating point operations per second (FLOPs), ResNet101 consists of 101 layers and 7.6×10⁹ FLOPs, ResNet152 contains 152 layers and takes11.3×10⁹ FLOPs. ResNet is the workhorse of many mod-ern object detection frameworks, such as Mask R-CNN (He, Gkioxari, et al.

2018), RetinaNet (Lin, Goyal, et al. 2018), Faster R-CNN (Ren et al. 2016), R-FCN (Dai et al. 2016), etc.

• Feature Pyramid Network (FPN) (Lin, Dollár, et al. 2017): detecting objects at different scales has long been a challenge in computer vision. Pyramid method is one of the standard solutions to this problem (Adelson et al. 1984).

However, using pyramid representations in DNNs are time and computation-ally expensive. FPN is a fast, computational effective feature extractor de-signed to compute the pyramid feature maps in a fully convolutional fashion.

FPN is used in models such as Mask R-CNN (He, Gkioxari, et al. 2018), RetinaNet (Lin, Goyal, et al. 2018).

Since the DNN methods for object detection used in this work (Mask R-CNN and RetinaNet) have ResNet and FPN as their backbone feature extractors, these archi-tectures will be studied in more details in the following sections.

Deep Residual Neural Network (ResNet)

Deeper neural networks are shown to be able to enrich the levels of feature represen-tations that can be learned from input images (Zeiler and Fergus 2013). Moreover, (Simonyan and Zisserman 2015) and (Szegedy et al. 2014b) show that depth is a crucial element in the performance of neural networks. However, when the net-works get deeper, two main challenges arise, which are the vanishing/exploding gradients (Y. Bengio, Simard, and Frasconi 1994) and the degradation problem (He and Sun 2014).

The vanishing/exploding gradients problem refers to when the gradient of the loss function with respect to the network’s weights become too small or big, pre-venting the weights from changing their values. In very deep neural networks, the vanishing/exploding gradients problem hinders the convergence from the beginning, and has been tackled by normalized initialization (Y. A. LeCun et al. 2012; Glorot and Yoshua Bengio 2010) as well as intermediate normalization layers (Ioffe and Szegedy 2015).

After the vanishing/exploding gradients problem was solved, the degradation problem emerges. It led to higher training error when more layers are added due to diﬀiculties when propagating information from shallower layers to deeper layers (He and Sun 2014; He, Xiangyu Zhang, et al. 2015). The degradation problem can be explained better using an example: consider two neural networks, one withn layers, and another deeper one with m layers, where m > n. If the deeper network can learn to make its first n layers produce the same representations with the n layers of the shallower network, and each of the remaining m −n layers simply outputs whatever it takes as input without changing anything, which is called “identity mapping”, then the deeper network is expected to perform as least as well as the shallower network. However, experimental results have shown that it is hard for deeper neural networks to learn these identity mappings (He, Xiangyu Zhang, et al. 2015), leading to the degradation problem. In order to address this, ResNet uses shortcut connections as its building block, as illustrated in Figure 2.6. The main idea of shortcut connections is to help information flow unimpeded through the entire network. With this modification, the authors demonstrated that training much deeper networks results in further increase in accuracy.

If the underlying function that needs to be learned is denoted as H(x), then ResNet lets the non-linear layers learn the residual function F(x), after that per-forms the shortcut connection to make H(x) = F(x) + shortcut(x), as shown in

Figure 2.6 A building block of ResNet.

Figure 2.6. In the context of ResNet,xand F(x) are matrices; if their dimensions are the same, the shortcut connection simply copiesx, and H(x)can be defined as H(x) =F(x;{W_i}) +x, (2.19) whereF(x;{W_i}) is the residual function to be learned, with its parameters{W_i}. Ifx and F(x) have different dimensions, then the shortcut function can be imple-mented as a learnable layer, or

H(x) = F(x;{W_i}) +W_sx. (2.20) For ResNet with 50 and 101 layers, the residual function F is implemented as a stack of three convolution layers, each one has a batch normalization and ReLU activation. This building block is called the “bottleneck block” as in (He, Xiangyu Zhang, et al. 2015) and is illustrated in Figure 2.7.

ResNet 50 and 101 are built from these bottleneck blocks, as shown in Figure 2.8. They consist of five convolution stages (stem, res2, res3, res4, res5), followed by a fully connected stage. The first convolution stage consists of a 7×7 convolution layer with batch normalization, ReLU activation and max pooling. The last four convolution stages are implemented by stacking the bottleneck blocks on top of each other. The outputs of the convolution stages then go through an average pooling to produce the downsampled feature maps, which are then flattened into feature vectors. Finally, the fully connected (FC) layer takes these feature vectors to produce the final outputs. The implementations of 50-layer and 101-layer ResNet

Figure 2.7 The bottleneck block of 50-layer and 101-layer ResNet. Conv i×i means a convolution layer with kernel size of i×i.

Figure 2.8 ResNet50 and ResNet101.

from the API Detectron2 can be found on Github¹. Feature Pyramid Network (FPN)

Feature pyramids are important for detecting objects at different scales (Lin, Dol-lár, et al. 2017). Although with convolutional layers and pooling, deep CNNs can compute feature maps in a hierarchical and pyramidal structure, these feature maps contain large semantic gaps caused by different depths. Furthermore, using pyra-mid representations in DNNs are computational and memory intensive. To improve upon this matter, FPN (Lin, Dollár, et al. 2017) builds a feature pyramid by ex-ploiting the inherent multi-scale, pyramidal hierarchy of DNNs. The construction of FPN involves a bottom-up pathway and a top-down pathway, as illustrated in Figure 2.9. With this architecture, FPN eﬀiciently constructs feature pyramids from a single-scale image with marginal extra cost.

1https://git.io/JWkqJ

Figure 2.9 FPN overall architecture.

The bottom-up pathway is a backbone CNN for feature extraction. In this the-sis and in (Lin, Dollár, et al. 2017), ResNet is used. The way that FPN utilizes ResNet to extract the features pyramid is demonstrated in Figure 2.10. As ex-plained in 2.1.4, there are 5 convolution stages in ResNet. The output feature maps of the four residual stages (res2, res3, res4, res5) are chosen to be the set of fea-ture maps in the bottom-up pathway for FPN. In Figure 2.10, they are denoted as {C2;C3;C4;C5}. Going deeper into the ResNet, i.e. from C2 to C5, the spatial resolutions of the feature maps decrease by half after each stage. However, lower resolution feature maps contain better semantic values, since after each convolution stage, more high-level structures are detected by the network (Lin, Dollár, et al.

2017). Using these outputs from the backbone ResNet, FPN adds a top-down path-way. First, M5 is created by putting C5 through a1×1 convolution. Other feature maps in the top-down pathway, namely{M2;M3;M4}, are created by

M_i =Up(M_i+1) +Conv1×1(C_i), (2.21) where i ∈ {4,3,2}, Up is the nearest-neighbor interpolation upsampling function with scale factor of 2. Finally, 3×3 convolution layers are applied to all the top-down feature maps to reduce the aliasing effect of upsampling (Lin, Dollár, et al.

2017) and create the final feature maps {P2;P3;P4;P5}. FPN’s implementation can be found at².

In document Custom Object Detection with Deep Learning and Synthetic Datasets (sivua 18-23)