One-stage object detection - Efficient Deep Learning for Person Detection

In Faster-RCNN and R-FCN, the networks need to run classification on the proposed bounding boxes multiple times on different areas of the input image [22]. One-stage object detectors do not need separate networks for bounding box proposals and clas-sifications. They calculate the bounding box locations and categories simultaneously, making their inference much faster than with two-stage detectors.

Backbone network

Input image SSD convolutional layers Detections

Figure 5.5.The SSD architecture. Adapted from [27].

SSD

The SSD (Single Shot Detector) architecture is multiple times faster and nearly as accu-rate as the previously mentioned two-stage detectors [27]. It produces a fixed number of bounding boxes and scores in a single pass through the network. SSD predicts the object bounding boxes straight from the extracted features.

The architecture of the SSD network has a feature extractor base network with added feature layers and convolutional layers for predicting the detections. The convolutional layers on top of the base network in SSD allow predictions in multiple scales. The layer size decreases in each subsequent layer. Each of these so-called feature layers produce a fixed number of prediction outputs. The network architecture is illustrated in Figure 5.5.

The number of predictions is related to the default bounding boxes defined for the net-work, similarly to the anchor boxes in Faster-RCNN. The difference is that the feature maps with different resolutions contain different sized default boxes. Thus, objects with varying shapes and scales can be detected. The default boxes create a tile of cells over the created feature maps.

The default box aspect ratios and scales are manually configured for training. They can be designed to fit an application and dataset. For example, when detecting traffic signs like speed limits, there is no need for other than square aspect ratios, and others can be removed for faster computations. The default scales are between 0.2 and 0.9, where the highest scale is in the highest layer and the scale gets smaller in the following layers. The default aspect ratios are ¹₃,¹₂,1,2,3, so the aspect ratios are between 1:3 and 3:1.

At the location of each anchor, the object’s location offset is predicted with 4 offset values along with the class category scores. The number of outputs of a feature map of size m×nis(c+ 4)kmn, wherecis the number of classes andkthe number of default boxes.

The object bounding box is predicted as the height, width and x- and y-coordinates of the object based on the offsets. A visualization of the anchor box detections is presented in Figure 5.6.

Location: (cx, cy, w, h) Class conﬁdence: (c₁,c₂, ..., c_n)

Figure 5.6. An example set of SSD detections and the anchor boxes in two different feature maps that are responsible for the detections. Each bounding box contains its location, size and class confidences. Adapted from [27]

The loss function used by SSD during training is a weighted sum of the localization loss and confidence loss.

L= 1

N(λL_loc+L_conf), (5.2)

whereN is the number of matched default boxes andλthe weight term set to 1 by cross validation. The localization loss minimizes the error between the predicted bounding box to the ground truth with smooth L1 loss. The confidence loss is calculated as the softmax loss over the predicted class confidences.

SSD is best used in situations that require fast, real-time detecting of objects [22]. Its accuracy is comparable to slower, more complicated models such as Faster-RCNN when detecting large objects. However, it has problems detecting small objects due to the fixed anchors it uses. Compared to other object detectors, SSD’s performance does not lower as much when using less complicated feature extractors. This further increases the possibilities to increase a model’s speed and reduce its size.

YOLO

The YOLO (You Only Look Once) object detector model can also be used for fast and efficient detecting of objects [28]. Similarly than SSD, YOLO provides the detection in a single pass through the network. The strengths of YOLO are the ability to learn contextual information from the whole image and the ability to generalize well. However, it has the same setbacks as SSD with the difficulty to detect small objects in images.

The network architecture of YOLO is a single neural network. The network first divides the input image into a grid. Each object in the image is detected by the grid cell where the object’s centre point is in. The grid cells predict the bounding box, the confidence of the detection and the predicted class for all the objects it is responsible of detecting.

The same object can be found by cells close to each other, and the one with the highest confidence can be chosen to have the correct detection. The YOLO detection pipeline is presented in Figure 5.7.

Gridded input

Bounding boxes &

conﬁdences

Class probabilities

Output

Figure 5.7.The YOLO detection pipeline. Adapted from [28].

The base YOLO network consists of 24 convolutional layers and 2 fully connected layers.

Fast YOLO, a smaller and faster version, reduces the number of convolutional layers to 9. The YOLO network has been further improved with YOLOv2 [29] and YOLOv3 [30].

These networks can be used to detect a very large number of classes in real-time.

The YOLOv2 network is more accurate and faster than the YOLO network [29]. Instead of making the network larger to improve accuracy, YOLOv2 is a simplified version of YOLO.

Better practices are used for the network to learn more accurately. Batch normalisation is used on all convolutional layers. The input images are larger than in YOLO, but the sizes are also changed between epochs in training to give robustness.

Instead of predicting bounding boxes directly from the top layers, YOLOv2 uses anchor boxes similarly than the SSD network. The anchor boxes are assigned to cells in the image’s feature maps so they are not focused on only one area of the image. K-means clustering is used in each of the cells to determine the locations of the anchor boxes.

A new classification model is defined as a feature extractor for the YOLOv2 network.

The Darknet-19 model is based on the GoogleNet [31] classification model. The model architecture uses mostly convolutions with 3x3 filters and 1x1 filters to compress data between them. The amount of feature channels is doubled after each max pooling layer.

The resulting model has 19 convolutional layers with batch normalisation and 5 max-pooling layers. The Darknet-19 feature extractor is trained simultaneously with the YOLOv2 detector. The images given to it are training data either for classification or detection, and the correct part of the combined network is trained based on the input.

Backbone network Input image

Heatmaps Embeddings

Detections

Figure 5.8. The CornerNet architecture. The upper path detects the top-left corners and the lower path the bottom-right corners. Adapted from [32].

The YOLOv3 network does small improvements to the previous network to achieve even better performance [30]. A new feature extractor is defined to improve accuracy of the detector model. The Darknet-53 model works similarly than the Darknet-19 model, but with 53 convolutional layers. YOLOv3 does not struggle with small objects as much as its previous versions.

Additional convolutional layers are added on top of the backbone network. These layers have feature maps of different scales. These can be combined to predict bounding boxes in 3 different scales. The information is given further for the last few convolutional layers that combine the information and produce the multi-scaled detection outputs.

CornerNet

CornerNet uses a single convolutional network to predict heatmaps of top-left and bottom-right corners of object categories and an embedding vector to combine the information [32]. Many detectors attempt to detect the centre of the objects, but this requires infor-mation in all four directions from the centre. Detecting the corners only require analysing two directions to find the rest of the object.

Unlike SSD and YOLO, CornerNet does not use anchor boxes. A large number of anchor boxes with multiple different hyperparameters would be needed for accurate predictions.

The CornerNet network uses its own approach for localizing the objects from the features in the backbone network.

The network combines corner heatmaps, an embedding vector and offset estimation to predict the final bounding boxes of objects. The architecture is presented in Figure 5.8.

During training, the corner heatmap predictions are penalized based on the distance from the ground truth corner locations. The corners are embedded with an associative embed-ding method and the distances between the embedembed-dings determine two corresponembed-ding corners. The embedding is done to separate objects and find their matching corners.

Because the actual locations of the corners are usually outside of the object boundaries, corner pooling is developed to recognize the corner location. Corner pooling uses max-pooling so, that the top-left predictor pools feature vectors below and to the right of the point. The bottom-right predictor pools feature vectors above and to the left of the point.

The two pooled directions are then combined.

CornerNet achieves equal accuracies compared to SSD and YOLO, but it requires more computational power and time [33]. Two variations of CornerNet have been developed to tackle this problem: CornerNet-Saccade that attempts to have comparable accuracy with lower computational cost and CornerNet-Squeeze that attempts to maintain accuracy in a lightweight, real-time setting.

Saccades in object detection refer to the human vision system saccades which mean the eye movement of focusing on different regions of an image. The CornerNet-Saccade attempts to detect objects from small regions in the input image that might contain the objects.

First off, the possible object locations in the image are predicted. The coarse locations are found from two down-scaled versions of the image. Three attention maps are predicted from both down-scaled images. The attention maps are used to predict different sized objects. The attention maps are predicted from the feature maps of the architecture’s feature extractor network in different scales.

When the coarse locations of possible objects are found from the down-scaled images, the regions are examined again with a higher resolution. Depending on the attention map, thus the object size, the down-scaled image is zoomed in to the predicted location to find a more accurate detection.

To increase performance, the different zoomed in regions are processed in batches in the GPU. Additionally, a maximum number of object location predictions is given. Areas with a larger likelihood of containing objects are prioritized. The Soft-NMS postprocessing algorithm is used to restrict the amount of predictions based on the prioritized locations.

CornerNet-Squeeze uses similar approaches to reduce computations as SqueezeNet and MobileNet, presented in chapter 4. These are used to design a more efficient back-bone network. The main approaches are to replace 3x3 convolutions with 1x1 convolu-tions, decreasing the number of input channels to 3x3 convolutions and downsampling the image size later in the network.

The residual blocks in the CornerNet-Squeeze network are switched to the fire modules used in SqueezeNet. The 3x3 convolutions are replaced with 3x3 depthwise separable convolutions as in MobileNet. The number of feature maps is reduced with the addition of one more downsampling layer. Additionally, the upsampling is changed from nearest neighbour to transpose convolution with a 4x4 kernel.

Figure 5.9.Finding the most relevant object detector outputs with the NMS algorithm.

The performances of the lightweight versions have been evaluated to be efficient with an example detection dataset. The CornerNet-Saccade network achieves a slightly better accuracy and speed than CornerNet. The accuracy is increased from 40.6% to 42.6%

while speed increases by 11%. The CornerNet-Squeeze is much faster, with detection time down by 85% with an accuracy of 32.4%. This speed and accuracy is comparable to the YOLOv3 network.

In document Efficient Deep Learning for Person Detection (sivua 37-43)