System BD-rate and front BD-rates - Experimental Results

3 PROPOSED METHOD

4.2 Experimental Results

4.2.3 System BD-rate and front BD-rates

Next, the proposed method is evaluated with the VVC/H.266 baseline in Bjøntegaard Delta (BD) rate. BD-rates were introduced in the subsection 2.4.4. A negative value of BD-rate indicates the amount of bitrate savings to achieve the same level of task per-formance. There are two considerations on the range of the bitrates. Firstly, when the machine branch is turned off, the system works with the human branch only, which is a state-of-the-art traditional video codec. Secondly, if the task performance on object detection is below a certain level, the results would probably not be very useful in any application.

Acceptance threshold level was subjectively determined to be10% mAP by visualizing the detection results. Therefore, only a bitrate range for which the proposed system obtains at least 10% mAP score is considered. The BD-rate between the VVC/H.266 anchors and the Pareto front of the results was calculated on the range[0.012,0.072], where

1. the machine branch is switched on, and 2. the system performance is above10% mAP.

In this interval, the achieved average improvement in the BD-rate is−40.49% as shown in Table 4.2. Outside of this bitrate range the system can always fall back to the traditional codec, i.e., switching off the machine branch. Therefore, the proposed system never underperforms the VVC codec.

Table 4.2. Front range, front BD-rate of models trained with different settings and the system BD-rate. The front range is given as [lower boundary, upper boundary].

Setting Front range (BPP) Front BD-rate (%) 50%, 37,α= 5 [0.035,0.072] −21.31 50%, 37,α= 10 [0.034,0.064] −24.18 50%, 42,α= 5 [0.022,0.058] −23.42 50%, 42,α= 10 [0.022,0.049] −23.00 50%, 48,α= 1 [0.020,0.058] −21.69 50%, 48,α= 2 [0.017,0.045] −32.48 50%, 48,α= 5 [0.013,0.025] −37.27 50%, 48,α= 10 [0.013,0.015] −43.44 50%, 51,α= 5 [0.013,0.042] −36.17 50%, 51,α= 10 [0.012,0.056] −34.58 System BD-rate [0.012,0.072] −40.49

As already discussed in the subsection 4.2.1, different settings contribute to various ranges on the Pareto front. The front range of a setting is defined to be the bitrate range where the models trained with the setting outperform the VVC/H.266 anchor; the front BD-rate is defined as the BD-rate that the setting achieves on its front range with regards to the VVC/H.266 anchor. The models trained with a settings with lower base BPP val-ues, e.g. setting "50%, 48" in Table 4.1, have better performance at a lower bitrate range.

Meanwhile, theαvalue, which determines the weight of the rate loss in Eq. 4.2, has also a significant impact on the front range. Meanwhile, the α value, which determines the weight of the rate loss in Eq. 4.2, has also a significant impact on the front range. The settings with lowerαvalues tend to save less on the BD-rate, but are more general, i.e., the trained models cover a larger bitrate range.

4.2.4 Visualizations

The inherent advantage of writing a paper in the field of computer vision is definitely that the processed data and results can be easily visualised in a human-understandable fashion, since images are a natural part of these systems. In this section, three figures are provided to illuminate why the proposed system achieves exceedingly competent performance.

The first such visualization is provided in the Figure 4.3. It shows the visualization of the images and the extracted features at different stages of the proposed system. To obtain a 2-dimensional representation, the features were averaged over all channels – the brighter the color, the larger the value. The data were collected from the model working at bitrate 0.025 BPP trained using the setting "50%, 48α= 5" in the Table 4.2.

As shown at the upper row of columns (a) and (b), the object detection predictions on re-constructed imagexˆat the human branch are clearly inferior to the input imagex. This is due to the considerable compression by the VVC/H.266 codec at the human branch with the picked setting. From the upper row of column (d), it is evident that the performance of the object detection is significantly improved with the help of the enhanced decoded featuresfˆ

enh. As a sidenote, it should be pointed out how the detections seem to uni-versally deter towards the edges of the images. One explanation for this is that when the objects are located partially outside the bounds of the image, detecting them becomes more difficult, since there is less information available to make the detection – compared to objects that are more centrally located.

From the feature residuals fr and decoded feature residualsfˆ

r at column (c), it is inter-esting to note how the feature residual codec has learned to discard unimportant details and seemingly retain only the most important parts of the feature residuals. The texture in the background regions, i.e., regions where no objects are present, are remarkably smoothed, while the features in the foreground regions, i.e., regions where objects are present, are preserved after the compression. Similar behavior has also been observed in [50] where the compression is performed on the image domain. This behavior is

im-Figure 4.3. Images and features derived at different stages. Samples from two input images are selected and arranged in 4 columns. Each column contains two samples for each input image. (a) Up: input imagexwith detected objects by the task network shown in boxes. Down: features fx extracted from x. (b) Up: reconstructed image xˆ at the human branch with detected objects by the task network. Down: features f_x_ˆ extracted fromxˆ. (c) Up: feature residualsfr=fx−fxˆ. Down: decoded features residualfˆ

rat the machine branch. (d) Up: decoded image xˆ at the human branch with detected objects by the task network using enhanced decoded featuresfˆ

enh. Down: enhanced decoded features fˆ

enh = fˆ

r +fxˆ. Features are represented by the average along the channel dimension and scaled for better visualization. A threshold value of 0.5 is used to visualize the detected objects. Bounding boxes for different classes are drawn with distinct colors.

portant for the proposed system to achieve a better trade-off between the compression of the residuals and the improvement to the task performance without heavily inflating the total bitrate of the system.

The behavior of retaining the features at important regions is further illustrated in the Figure 4.4. Three models were selected, working at bitrates 0.042 BPP, 0.027 BPP and 0.018 BPP, respectively. The models were trained with the same setting as the model in the Figure 4.3. As shown in columns (b-d), the textures at the background regions are partially preserved by the model working at the higher bitrate, while heavily smoothed by the model working at the lower bitrate. It is also interesting to see that the low bitrate model generates decoded feature residuals that do not contain any meaningful informa-tion for the second sample (second row) since it does not contain any objects (the classes of the objects were defined in the subsection 4.1.1).

Finally, the last visualization is presented in the Figure 4.5. Comparing the predictions in the columns (b-d) to the ground truth detections, the best predictions are generated by the model working with uncompressed input images (b), nearly as good predictions are produced by the proposed method in the (d) column, and the predictions for the

Figure 4.4. Decoded features residuals by models on different bitrate targets.

Samples from two input images are selected and arranged in 4 columns. (a) Input image x. (b-d) Decoded feature residuals fˆ

r at the machine branch by the model with bitrate 0.042, 0.027, 0.018 BPP.

compressed images in the (c) column are indubitably the worst of the three. However, this comparison is not quite fair, since it does not take into account the differences in the total bitrates between different columns. It merely illustrates how immensely the results are affected after introducing the machine stream (transition from (c) to (d)).

Inspecting these images carefully, one can see that the ground truth bounding boxes contain clear mistakes. This is particularly apparent in the last row, which is separately visualized in the Figure 4.6. According to the ground truth bounding boxes, there is only one bike stationed next to the wall in the left side of the image, and no person is present. However, there are a total of four bikes, and a person is sitting in a chair near the middle of the image, next to the building which is illuminated by the sunshine.

These discrepancies can be attributed to human mistakes while annotating the dataset.

Nonetheless, the consequence of these mistakes is that even systems, which learn to detect objects perfectly, are penalized for correct detections, which definitely hurts their training efficiency and test-time performance. In fact, in the provided example (last row of the Figure 4.5), the system was able to detect some of the objects that were present but were not annotated. These detections would then be labelled as false positive, lowering the performance score of the system. Occurrence of these mistakes could be partly attributed to why the typical performance in the object detection task in the Cityscapes dataset is relatively low – reaching less than30% mAP score without compression, where for other datasets, such as COCO, the state-of-the-art is clearly above50% with the same criteria (mAP@[0.5:0.05:0.95]) [20].

Figure 4.5. Object detection predictions comparison. Samples from 4 input images are selected and arranged in 4 columns. (a) Input imagex with ground truth detections shown in boxes. (b) Input image xwith detected objects by the task network using fea-tures fx. (c) Reconstructed image xˆ at the human branch with detected objects by the task network using featuresf_x_ˆ. (d) Decoded imagexˆat the human branch with detected objects by the task network using enhanced decoded features fˆ

enh. A threshold value of 0.5 is used to visualize the detected objects. Bounding boxes for different classes are drawn with distinct colors.

Figure 4.6. Ground truth detections from the last row of the Figure 4.5. The visual-ization is meant to highlight the absence of ground truth boxes. In this example, only one bike was annotated. Moreover, the sitting person in the chair was not annotated either.

In document Enhancing Image Coding for Machines with Compressed Feature Residuals (sivua 66-71)