Effects of parameters - Efficient Deep Learning for Person Detection

The training of each of the models took between 15 and 25 hours depending on the model complexity. A computer with a GTX1080 GPU was used for the training. The resulting speeds and accuracies of the trained models are collected into Table 7.1. The mAP value is the average mAP with IoU values between 0.5 and 0.95 or only with the mAP with IoU of 0.5 (mAP@.50). Most of the results from the TFLite evaluations are left out because of their low relevance compared to other results and to simplify the table.

Input Size Width Multiplier SSD Type Inference Engine mAP mAP@.50 FPS

160 0.35 SSD Tensorflow 0.12 0.28 158

160 0.35 SSD TensorRT 0.12 0.30 154

160 0.5 SSD Tensorflow 0.12 0.29 144

160 0.5 SSD TensorRT 0.12 0.30 140

160 1.0 SSD Tensorflow 0.16 0.35 129

160 1.0 SSD TensorRT 0.16 0.36 130

224 0.35 SSD Tensorflow 0.16 0.34 151

224 0.35 SSDLite Tensorflow 0.17 0.36 155

224 0.35 SSD TensorRT 0.16 0.34 151

224 0.35 SSDLite TensorRT 0.17 0.36 149

224 0.5 SSD Tensorflow 0.16 0.35 140

224 0.5 SSDLite Tensorflow 0.15 0.32 147

224 0.5 SSD TensorRT 0.16 0.35 144

224 0.5 SSDLite TensorRT 0.15 0.32 149

224 1.0 SSD Tensorflow 0.20 0.41 121

224 1.0 SSDLite Tensorflow 0.19 0.39 124

224 1.0 SSD TensorRT 0.21 0.41 125

224 1.0 SSDLite TensorRT 0.19 0.39 121

224 1.0 SSD TFLite 0.16 0.35 6.3

400 0.35 SSD Tensorflow 0.21 0.41 120

400 0.35 SSD TensorRT 0.19 0.38 125

400 0.5 SSD Tensorflow 0.23 0.44 112

400 0.5 SSD TensorRT 0.22 0.42 119

400 1.0 SSD Tensorflow 0.26 0.49 94

400 1.0 SSD TensorRT 0.24 0.45 99

Table 7.1. Results of the detector evaluations.

0,2 0,3 0,4 0,5 0,6

80 100 120 140 160

mAP

@.50

FPS

Figure 7.1.Effects of the input size. Blue=160x160, orange=224x224, green=400x400.

The table of results contains a large amount of information. No relevant conclusions can be made just by quickly looking at all the data. To get a better understanding of the results, each of the individual effects of the parameters need to be analysed further.

The presented graphs of parameter effects are compared only to the base model in the rest of the section. This is done to simplify the graphs, but other information from the evaluation table is presented as well if needed. The accuracy of an object detector is often reported and compared with the 0.5 IoU value, so that is presented in the following graphs.

Effects of input size

The input sizes experimented for this study were 160x160, 224x224 and 400x400. The effects of input size on model accuracy and speed can be seen in Figure 7.1. The in-put size has a large effect on the model performance. This is understandable because the objects become extremely small in figures of only 160x160 pixels. The people in the dataset’s images can be quite small even in the full-scaled images. The full-scaled im-ages can be over 1000 pixels in one diameter so radically dropping the input size can make most of the detections impossible.

The trade-off between speed and accuracy is quite linear in the researched scope. De-creasing the input size by 60% increases the speed by 36% and decreases the accuracy by 27%. However, this does not work as linearly in a larger scope. When the input size gets closer to zero, it becomes impossible to detect any objects. On the other hand, the network can detect objects from reasonable sized images. Increasing the input size to be extremely large would make the model very slow with only a small effect on accuracy.

The accuracy might even get worse due to the large amount of data in each image.

0,2 0,3 0,4 0,5 0,6

80 100 120 140 160

mAP

@.50

FPS

Figure 7.2.Effects of MobileNetV2 width multiplier. Blue=0.35, orange=0.5, green=1.0

Effects of number of channels

The backbone MobileNetV2 width multiplier was experimented. The effects on speed and accuracy with the values 0.35, 0.5 and 1.0 can be seen in Figure 7.2. The width multi-plier has a similarly large effect on the model’s performance as the input size. Dropping the number of channels in the backbone network does make the model faster, but the accuracy drops as well.

Dropping the width multiplier to a third of the initial model increases the FPS by 31% but drops the accuracy by 29%. The effects seem linear in the scope of this research. The effects are probably not linear if the number of channels is increased to be very large or decreased to be very small.

Modifying the backbone network’s number of channels could be used to quite accurately drop or increase the model’s size with a predictable effect to get a resulting model that fits the user’s need. The MobileNetV2 architecture is designed with the width multiplier in mind, but other networks could benefit with similar thinning to get a slightly lighter model.

Effects of detector convolution type

Using normal 3x3 convolutions in the SSD detector versus combined pointwise and depthwise convolutions in the SSDLite detector has quite small effects. Compared to the base model, using SSDLite instead of SSD offers a 2% speed-up with a 3% decrease in accuracy with the base model. This can be used to get the few extra frames per second if needed, but otherwise using SSDLite is not very helpful.

The smallest SSDLite model with an input size of 160x160 and width multiplier of 0.35 was more accurate and faster than the corresponding model with SSD. However, this might result from the model’s learning at a different pace. The other models were gener-ally slightly faster and less accurate when using SSDLite.

6,3 FPS

Figure 7.3. Effects of inference engines. Blue = TFLite, orange = TensorRT, green = TensorFlow

Effects of inference engines

The evaluations were run on TensorFlow, TensorFlow-Lite and TensorRT inference en-gines. Their results can be seen in Figure 7.3. TensorFlow-Lite does not use the GPU during inference and should slightly increase CPU performance. However, using a GPU is much faster if it is available, so its results in this comparison are poor. TensorFlow-Lite should only be considered when a GPU is not available, but neural networks might not be a good choice in a situation where only a CPU is available for computations.

TensorRT gives an identical accuracy as TensorFlow compared to the base model, but it is 3% faster when used with the base model. In most of the other comparisons between TensorRT and TensorFlow, the effect is not as beneficial. In any case, it should bring a slight increase in the model’s inference speed. Based on the TensorRT documentation, TensorRT should have hardly any effects on a model’s accuracy.

In document Efficient Deep Learning for Person Detection (sivua 48-51)