Training Strategy - 3 PROPOSED METHOD - Enhancing Image Coding for Machines with Compressed Fea

3 PROPOSED METHOD

3.4 Training Strategy

The objective of the proposed system is for it to learn to maximize compression of the input images (or their feature representations) while minimizing distortion, measured by decline in downstream computer vision tasks. This problem is namely therate-distortion optimization (RDO) and it was discussed in the subsection 2.1.2. In practise, the pro-posed system is trained using a multi-objective loss function – loss functions were dis-cussed in the subsection 2.2.5. This loss function comprises of multiple loss terms: a

downstream computer vision task loss functionL_task, a bitrate loss functionL_rate and an auxiliary mean squared error (MSE) lossL_{M SE}; these are later referred to as task loss, rate loss and MSE loss in short.

To reach a specific encoding bitrate, the loss terms are weighted using Lagrange multi-pliers, denoted as λ, which were also discussed together with the loss functions. Sys-tems with deep learning methods are usually trained for variousλvalues to find suitable bitrate-distortion pairs after enough training. The trained models are then saved for the favourable bitrate-distortion pairs and can be later employed to compress further images for a target bitrate or for a target distortion [4, 17].

Essentially, the neural networks of the feature residual codec are trained using the multi-objective loss function, which can be formulated as

L_total =w_rateL_rate+w_taskL_task+w_{M SE}L_{M SE}, (3.2) whereL_rate,L_task andL_{M SE} are the loss terms, andwrate,w_taskandwM SE are the cor-responding Lagrange multipliers, referred to as weights. L_rate is obtained by the entropy model as discussed in the section 3.2,L_taskis obtained by the task network as discussed in the subsection 2.2.7 and

is the mean squared error between the feature residuals f_r and the decoded feature residualsfˆ

r. Although the MSE metric is not directly related to the objectives of the pro-posed system (increasing downstream computer vision task performance and decreasing bitrate), it is introduced to stabilize the training, inspired by the research in [50].

Choosing proper weights for all loss terms involves excessive hyperparameter search and fine-tuning. On one hand, a weighting scheme with greaterw_task andwM SE would result in a system that achieves increased performance on the machine tasks, at the cost of a higher bitrate. On the other hand, a weighting scheme with too great w_rate would result in a system that is able to compress the feature residuals efficiently – but these feature residuals might not sufficiently enhance the features at the decoder side with respect to the machine task performance.

The authors of [50] introduced a dynamic loss weighting strategy for efficient training with multiple loss terms. In their study, a dynamic loss weighting strategy is used since reaching a good balance between the objectives is inflexible with static weights. Addi-tionally, it allows obtaining various models with a wide range of desired bitrates and task performances, i.e., rate-distortion trade-off pairs, while using only one training process.

Albeit promising, there could be downsides to this approach. Evidently, there is a risk built into dynamic loss weighting – if the weighting alters too rapidly, the system may only find sub-par rate-distortion trade-off pairs, since in that case it had not enough iterations to converge on an optimal solution. However, the potential problem could be prevented

by making sure the weighting is stable enough for the system to always converge before changing weights.

The proposed system adopts the aforementioned dynamic loss weighting strategy. The weighting is visualized in the Figure 3.5 as the function of epoch. The training strat-egy starts by weighting the MSE loss with a constant value, whereas other losses are weighted with zero weight. While the weight of the MSE loss w_{M SE} remains constant throughout the whole training process, w_task andwrate are weighted as a function of the epoch. Therefore, the system trains with only L_{M SE} during the initial epochs, which can be considered as a warm-up period. The warm-up period is reckoned important for a system that includes both randomly initialized and pretrained neural networks, because of the unstable behavior of the pretrained neural networks when working on out-of-domain samples [50, 88, 94].

Figure 3.5. Loss weighting strategy. Dotted lines separate different phases. Note that the y-scaling is linear until 1, and logarithmic after.

Identically to [50], the training is divided into 5 phases, with increased weights for the losses relevant to the system objectives, wrate and w_task, during the latter phases; the rate loss is gradually assigned a dominant weight during the final phase. This way, the task performance is likely to degrade, but the system finds increasingly compressed rep-resentations for the feature residuals. Therefore, a wide range of rate-distortion pairs gets covered.

4 EXPERIMENTS

In this section, the completed experiments are described in detail. In the former part, the experimental setup is described, including information about the datasets used for training and testing the proposed system, the used task network, evaluation method and baseline system, and the refined training strategy. In the latter part, the experimental results are discussed, featuring rate-accuracy curve of the results, analysis on the impact of the bitrate of the decoded feature residuals fˆ

r to the performance of the proposed system, Bjøntegaard delta (BD) rate gain of the experimental results over the baseline system and, finally, various visualizations with examples that enlighten why the proposed system achieves exceedingly competent performance.

4.1 Experimental Setup

In document Enhancing Image Coding for Machines with Compressed Feature Residuals (sivua 57-60)