Object Detection Evaluation Metrics - Custom Object Detection with Deep Learning and Synthetic

In order to evaluate the detection performance of the network architectures, popular evaluation metrics such as Average Precision (AP), AP50, AP75, APm, APl are employed. AP is the primary metric, while other metrics such as AP50, AP75, APm, APl are derived from AP (Padilla, Lobato Passos, et al. 2021).

The outputs of a deep learning method for object detection are often composed of predicted bounding boxes, the corresponding classes and confidence levels. For Mask R-CNN, each detected object also has a predicted mask. In this work, only the metrics for bounding boxes are considered. This section gives a brief overview of how these metrics are formed mathematically, and the results are reported in section 4.

In essence, measuring the performance of an object detector involves determining if a prediction from a model is correct or not. The fundamental errors for object detection are (Padilla, Lobato Passos, et al. 2021):

• True Positive (TP) - a correct detection.

• False Positive (FP) - an incorrect detection, for example, assigning a wrong object class for a bounding box.

• False Negative (FN) - ground truth bounding boxes that are not detected.

Note that True Negative (TN) is not applied for object detection, as there are infinitely many bounding boxes that should not be detected. But how do we define if a prediction is “correct”? To determine these errors, we need the IoU metric of the predicted and ground truth bounding boxes. IoU is used to compare with a threshold t ∈ [0,1] to decide if a detection is correct or not. If IoU ≥ t, then the detection is correct, and if IoU≤t, the detection is considered as incorrect.

Assume that we have a dataset of G ground-truths, a model that outputs S correct detections out of the total N detections, and a confidence thresholdτto make the positive detections only those whose confidence higher than τ. The detections with confidence lower thanτ is considered negatives. Then, the concepts of precision (Pr) and recall (Rc) with the confidence thresholdτ (Padilla, Lobato Passos, et al.

2021) can be expressed as

number of detections , and (3.1)

Rc(τ) =

number of ground truths . (3.2) An ideal detector should have high recall (the ability to find all ground-truth objects) as well as high precision (the ability to find only relevant objects). The average precision (AP) is based on the area under theP r×Rccurve (AUC) which measures the trade-offs between precision and recall. A high value of AP means a large AUC, indicating high values of precision and recall.

To compute AP, first we need to define K different confidence values τ(k) in ascending order as

τ(k), k = 1,2, . . . , K such that τ(i)> τ(j) for i > j. (3.3) Then, a set of decreasing reference recall values is defined as

R_r(n), n = 1,2, . . . , N such that R_r(m)< R_r(n) form > n. (3.4) Before calculating AP, the P r×Rc pairs have to be interpolated such that the resulting P r×Rc curve is monotonic (Padilla, Lobato Passos, et al. 2021). The

Figure 3.12 An example interpolated P r×Rc curve. Figure (Padilla, Netto, and Silva 2020).

result is an interpolated curve defined by a continuous function Printerp(R) defined as

Printerp(R) = max

k|Rc(τ(k))≥R{Pr(τ(k))}, (3.5) where R is a real value in [0, 1],τ(k)is defined in Equation 3.3, and R_c is defined in Equation 3.2. An example of the interpolatedP r×Rc curve can be seen in Figure 3.12. Taking the Riemann integral ofPr_interp(R) with K recall values from the set of R_r(k) in Equation 3.4 as sampling points, the area under the P r×Rc curve, or AP, can be calculated as

AP =

∑K k=0

(R_r(k)−R_r(k+ 1))P rinterp (R_r(k)), (3.6) where

R_r(0) = Rc(τ(0)) = 1,

R_r(k) = Rc(τ(k)), k = 1,2, . . . , K, R_r(K+ 1) = Rc(τ(K+ 1)) = 0.

The evaluation metrics employed in this work are according to the metrics from the COCO dataset⁹. The primary metric, AP, is the average value of APs at 10 different IoU thresholdsτ = [0.5,0.55, . . . ,0.95]. AP50 is the area under theP r×Rc curve atτ = 0.5, which is widely used in the PASCAL VOC dataset and was adapted by COCO. AP75 is similar to AP50 but at IoU threshold τ = 0.75. APm and APl consider also the area of the ground-truth object. APm evaluates medium-sized ground-truth objects whose areas are in the range of [32²,96²] pixels. APl only evaluates large ground-truth objects with areas larger than 96² pixels.

9https://cocodataset.org/#detection-eval

4 Results and Discussions

Table 4.1 shows the object detection evaluation results for three DNNs employed.

AP AP50 AP75 APm APl MaskRCNN50 13.8 31.4 11.8 19.3 14.9 MaskRCNN101 14.6 31.7 10.9 19.9 15.2

RetinaNet101 11.3 31.3 4.3 21.2 11.6 Table 4.1 Object detection results, best scores are highlighted.

The results from Table 4.1 show that Mask R-CNN with ResNet backbone of depth 101 (MaskRCNN101) produces the highest score for AP at 14.6, which is the most important metric. MaskRCNN101 also obtains the highest score for AP50 and APl, at 31.7 and 15.2 respectively. Mask R-CNN with ResNet backbone of depth 50 layers comes second in terms of AP, at 13.8. MaskRCNN50 also achieves the best score for AP75, which indicates that this model produces accurate detections with high confidence. While RetinaNet gets the lowest AP score out of all three models at 11.3, it has the best score for APm, showing that RetinaNet performs well for medium-sized objects. These metrics, however, are not high compared to the highest result of Mask R-CNN reported by Facebook research, which is 47.2 for Box AP¹. Even though the scores are not considerably high, we can see from the output pictures that the detected tree trunks are well segmented, especially for the trees closer to the center of the testing images. This can be due to the fact that the algorithm that randomly puts the tree trunks on the background images rarely puts trees on the edge of the background images.

Another thing to notice is that even though the three models produce approx-imately equal scores on AP50 (all around 31), RetinaNet has a very low score on AP75, at 4.3, compared to those of MaskRCNN50 and MaskRCNN101, which are 11.8 and 10.9, respectively. These are also shown in the output images as in Fig-ure 4.1, where the confidence threshold is set to 0.5. At τ = 0.5, we can see that MaskRCNN50 and MaskRCNN101 produce many visually correct predictions, while RetinaNet does not produce any predictions with confidence higher than 0.5.

In order to visualize the predicted bounding boxes for RetinaNet, the confidence threshold needs to be set to be as low as τ = 0.25, as shown in Figure 4.2. At this confidence level, RetinaNet produces considerably many more bounding boxes for the tree trunks compared to MaskRCNN50 and MaskRCNN101, however, it has quite low confidence for all of them. All the bounding boxes produced by RetinaNet

1https://bit.ly/2Xr1Stw

Figure 4.1 Predicted output images at confidence threshold τ = 0.5.

Figure 4.2 Predicted output images at confidence threshold τ = 0.25.

Figure 4.3 Predicted output images at confidence threshold τ = 0.9.

have confidence in the range between 25% to 40%, while MaskRCNN50 and MaskR-CNN101 can output predictions with up to more than 90% confidence. This fact is visualized in Figure 4.3, where the confidence level is set to τ = 0.9. Here, Mask R-CNN models can still find the tree trunks with confidence up to 99%, whereas RetinaNet does not make any predictions at all.

Next is the discussion of the times taken to train and test the models. As explained in section 3.2, the three DNN models are trained on 2000 synthetic RGB images of spatial resolution [512×512]. The trained models are then used to do inference on 47 RGB images, each one has spatial dimension [1920×1200]. Both training and testing are done on NVIDIA GeForce GTX 1080 Ti GPU. The times that the three DNN models take to carry out the training and testing processes are shown in Table 4.2.

The results from Table 4.2 show that MaskRCNN50 is the fastest during train-ing, however, it is the slowest during testing (≈0.447 seconds per image). MaskR-CNN100 takes the most time during training while achieving moderate result during

Training (seconds) Testing (seconds)

MaskRCNN50 478 21

MaskRCNN101 622 18

RetinaNet101 511 15

Table 4.2 Training and Testing time of the models

testing (≈ 0.383 seconds per image). RetinaNet101 is the fastest when testing: it takes≈0.319 seconds to produce detections for one image.

Considering the fact that Mask R-CNN models generate the segmentation masks in addition to the bounding boxes, and their detections are of high confidence with high accuracy, the author concludes that they are more suitable for the applica-tion of tree trunk detecapplica-tion. Furthermore, Mask R-CNN with ResNet 101 backbone performs the best with the highest AP and AP50 scores. Even though it takes longer to train than the other two models, MaskRCNN101 achieves better test-ing time than MaskRCNN50, and when the confidence level is set to be high, e.g.

τ = 0.9, MaskRCNN101’s results are visually much better compared to the results from MaskRCNN50 and RetinaNet101 (Figure 4.3). In general, if training time is not a big concern, then MaskRCNN101 should be a go-to option. RetinaNet101 could be chosen when we need to detect as many objects as possible, even with low confidences.

In document Custom Object Detection with Deep Learning and Synthetic Datasets (sivua 53-59)