Machines as the end-users - Coding techniques on the image domain

2.4 Coding techniques on the image domain

2.4.4 Machines as the end-users

Cisco Annual Internet Report (2018-2023) [19] states that Machine-To-Machine connec-tions will raise from 33% of the total devices and connecconnec-tions in 2018 to 50% by 2023.

Nowadays, it is increasingly more common that the consumers of image data are ma-chines, not humans – or both machines and humans. The machines are typically trained and used with compressed images. These images are usually compressed with tradi-tional image codecs that are targeted for human consumers, as discussed in the previous subsection.

The human-targeted codecs are optimized for human consumption, which could cause problems when using them with machines. Most notably, machine vision, although of-ten times mimicking human performance on visual tasks, works differently than the HVS.

Especially, perceptual coding might discard information relevant for the machine task performance. Different compression artifacts could ruin performance on machines that were trained with images of different compression levels [42, 98]. Another concern with human-targeted codecs on machine end-users is that images usually contain irrelevant details to the machine task, bloating the bitrate. For example, a machine targeted for ob-ject detection task does not require an image to contain details on the background areas.

These problems cause sub-optimal compression and performance when machines are the end-users.

Thus, image and video coding for machines is a paradigm that has emerged recently to alleviate these problems – the paradigm considers machines as the primary consumers for the image data. The new coding techniques that stem from this paradigm are expected to increase image compression while simultaneously preserving or even surpassing pre-vious performance on machine tasks. The terms "image coding [or compression] for machines" [15, 16, 50] and "video coding for machines" [25, 27, 100] are used in the recent literature.

One potential approach to enhance image coding for machines is to adapt current tradi-tional codecs. However, jointly training both task neural network and the codec is chal-lenging, as many subsystems and procedures in the traditional image and video codecs are not differentiable. The encoder of the traditional image and video codec optimizes the rate-distortion(-perception) loss with a predefined visual quality metric that is designed for human viewers. The authors of [27] proposed to replace the human visual quality metrics in traditional codecs with metrics that are more suitable for machine tasks. In particu-lar, they proposed to use a distortion metric applied on feature maps extracted from the first layers of a pretrained proxy network. However, the encoding is only partially opti-mized for machine tasks as many subsystems cannot be optiopti-mized, thus compromising the performance of this approach.

Neural network-based image coding techniques have achieved similar or even superior

performance to traditional image codecs with respect to PSNR or SSIM metrics [4, 17, 62]. These systems are based on the autoencoder architecture and optimized end-to-end with a rate-distortion loss function. Since the task network is also a neural network, the compression system can be jointly trained with the task network to achieve the best performance. In [50], the authors proposed an end-to-end learned image codec where the distortion is measured by the task loss. The task network is pretrained and frozen during training. The authors of [70] proposed an end-to-end learned system targeting both human and machine consumption. In their system, the task network takes the en-tropy decoded bitstream as its input and a neural network-based decoder takes the same input to reconstruct the image. The distortion loss used is the weighted sum of the task loss and image quality loss. In [15, 16], the authors proposed solutions to optimize an end-to-end learned system with multiple task networks. Neural network-based video cod-ing has been also developed rapidly [56, 96, 97] over the last few years. Furthermore, several other techniques for image/video coding for machines have been proposed in the literature [25].

Although end-to-end trained codecs have proved to achieve better performance for ma-chine tasks than traditional codecs, the output of these systems is not meant to be vi-sualized for people. In many applications, human involvement is required or mandatory even though the task is mainly performed by machines. Traditional video codecs achieve state-of-the-art performance for humans and have been broadly supported by the indus-try. However, there has been little research on incorporating the strength of both the traditional video codec and neural network-based compression techniques.

Bjøntegaard model

To evaluate the performance of different image/video codecs, some metrics have been developed to standardize the comparison. One such evaluation metric is Bjøntegaard Delta Rate (BD-rate).

The proposed system is trained with different settings, and each run with a given setting produces multiple models with different average bitrates (measured in BPP) and average performances (measured in mAP) – these are referred to as BPP-mAP pairs in brief.

Similarly, the baseline system is evaluated with different settings, also leading to multiple BPP-mAP pairs. Some of these achieved pairs are strictly better than others. To com-pare the best possible models with the best possible baselines, only BPP-mAP pairs in the pareto frontier (pareto front) are compared. Pareto front is the set of all pareto ef-ficient data points in a multi-objective optimization problem [57] – the concept is easier to understand with the Figure 2.22 that exemplifies possible obtained BPP-mAP pairs and the formed pareto front for them. In the example, points 1, 2, 4, 7 and 9 are pareto efficient and thus would be part of the pareto front; points 3, 5, 6, 8 and 10 would be discarded.

Bjøntegaard model [6, 7] was developed to compare the performance of different video

BPP

mAP

Figure 2.22. Pareto front demonstration for arbitrary data points. The pareto front (line) is formed from pareto efficient data points, i.e., points that minimize BPP and maxi-mize mAP.

codecs, though it can be used for image codecs as well. It has gained popularity as the tool for this purpose [86]. The terminology used here is as follows: The reference codec containsanchor (data) points, and the proposed codec containsproposed (data) points.

Bjøntegaard Delta Rate (BD-rate) metrics can be calculated with respect to BPP-PSNR pairs, or BPP-Task pairs, where "Task" refers to the used task performance metric – in this thesis, the BPP-Task pairs are compared, and "Task" is mAP (which was described in the subsection 2.3.2). The compared pairs are chosen to be the best possible pairs for each codec, i.e., pareto front data points. The BD-rate represents the average savings in bitrate for same quality (PSNR) or task performance (mAP) in percentage. A negative value indicates compression gain, whereas positive value represents compression loss.

[86]

The computation is conducted as follows: Each proposed point is matched with an anchor point with the closest BPP. The BD-rate values are then calculated for the differences in task performance (mAP), over the range that the BPP values of the anchor points and the proposed points overlap. As a formulation, the bitrate saving for a given proposed point with relation to an anchor point is given as

∆R(P) = R2(P)−R1(P)

R₁(P) , (2.39)

where R₁(P) and R₂(P) are the bitrates of the anchor point and the proposed point, respectively, at a given performance P in mAP. The Bjøntegaard model works within a logarithmic scale for the bitrate interpolation – denotingr = logR, the bitrate gain in the logarithmic scale can be formulated as

∆R(P) = 10^r²^(P^)−r¹^(P⁾−1. (2.40)

A curve is fitted for both anchor points and proposed points to make them differentiable – the fitted bitrates are marked with rˆ_i. Over these fitted curves, the BD-rate can be approximated as

∆Rtotal≈10ˆ︁

(︃ 1 PH −PL

∫︂ PH

[rˆ2(P)−rˆ1(P)]dB )︃

−1 (2.41)

where P_L and P_H are the lowest and highest measured performance from either the anchor or proposed points, respectively, defined asPL= max{minP1,minP2}andPH = min{maxP₁,maxP₂}. Note, that the exponent is denoted with a hat to improve readability.

An illustration of BD-rate computation is given in the Figure 2.23.

BPP

mAP

Figure 2.23. BD-rate computation illustration. Note, that these are not the pareto curves of the actual reference codec and proposed codec in this thesis.

In document Enhancing Image Coding for Machines with Compressed Feature Residuals (sivua 47-51)