Discussion - Efficient Deep Learning for Person Detection

The effects of each of the training parameters compared to the base model can be seen in Figure 7.4. Some of the parameters had more impact than others. Especially the input size and number of channels (width multiplier) can be changed to produce very differently performing models. The optimal model can be found by combining the studied parameters.

The evaluations used in this research are not perfect. This is a result of training multiple different neural networks for the same duration. The training process of a neural network is somewhat random, and the best model should be chosen based on the performance.

Sometimes training two almost identical models might take different times to fit to the data just because of the randomness of the starting situation. Most likely none of the results have a perfect fit on the data and might be either underfitted or overfitted.

0,2 0,3 0,4 0,5 0,6

80 100 120 140 160

mAP

@.50

FPS

Input size Width multiplier SSD type Inference engine Base model

Figure 7.4. Effects of the training parameters compared to the base model

Based on the research done in the previous sections, some of the results do not appear to work in the intended way. For example, using TensorRT and its quantization methods should always make the model faster, but one of the models is actually slower than when running the evaluations on Tensorflow. Another example is that a few of the models with the width multiplier of 0.5 perform worse than with the same parameters and a width multiplier of 0.35.

The comparison of multiple trained models is difficult. Training the networks for the same amount of time results in models that are in different stages of training. On the other hand, attempting to find the perfect model for each of the trained networks results in more work.

As a by-product of this research, some insight on the importance of model selection and the difficulty of predicting how a network will train itself was found.

All in all, the results in this work give a rough estimate on how much the different pa-rameters affect the performance of networks. The largest and most relevant effects were brought from changing the image input size and the number of channels in the network.

Tuning the convolutional operations in the detector resulted in small effects, as did chang-ing the inference engine. However, new approaches for runnchang-ing detections and optimizchang-ing the inference of the model might turn up to make the detection pipeline more efficient.

8 CONCLUSIONS

There are different approaches for bringing efficiency to an object detector. Multiple ob-ject detectors, such as SSD and YOLO provide ideas for more efficient obob-ject detection.

Additionally, backbone classification networks used by the detectors, such as MobileNet and ShuffleNet have also provided ideas for different architectures and types of convo-lutional layers. The architectures have been developed to fit different needs and can be combined to form an efficient detector network.

A part of the efficiency problem is solved with a good classifier and detector combina-tion. To further increase the efficiency, good parameters should be picked for training a detector model. Rough guidelines on how the different researched parameters affect the efficiency of the trained SSD-MobileNetV2 object detector models are:

Parameter Effect

Input size High

Number of channels High Detector convolution type Low

Inference engine Low

Based on the evaluations on the SSD-MobileNetV2 model, the best performance is given by having an input size and width multiplier that is as large as the application’s speed requirements allow. Plain SSD is used for the object detector because SSD-Lite does not give enough of an advantage. TensorRT is used to run inference on the model, because it gives a slight increase in speed with a small effect on accuracy

For running the model on Jetson Nano hardware, the 1.0 width multiplier models ran out of memory during run-time. The best combination of parameters that resulted in an accurate model that still runs on Jetson Nano was with an input size of 400, width multiplier of 0.5, SSD and TensorRT. This model was evaluated on an evaluation computer with a GTX1080 GPU. It had 119 FPS and an accuracy (mAP@.50) of 0.42. This proved to be a good trade-off. With the Jetson Nano computer the model runs with around 7 FPS when using real-time video from a camera.

The aim of running a detector model on a lightweight Jetson Nano computer would be to perform real-time applications. The model runs with 7 FPS, which is enough for some applications, but not all. For example, 7 FPS would not be enough to track people in sports or to detect pedestrians for a vehicle. The application should not be too quickly

changing or need extreme precision. The model could be deployed for example to monitor the number of people in a building’s video feed or to calculate queue lengths.

The results should be taken with a grain of salt if other classifier and detector combina-tions are used. Some other models might have other parameters that could be changed to affect the model’s efficiency. However, changing the input size and number of channels should have a high effect on all deep convolutional networks due to the direct effect on computations.

The number of trained models for the research could be higher to more accurately de-termine the effects in a larger range of parameters. Another problem that makes com-parisons complicated is that the networks might not learn the data in the same amount of time. The research gives a rough estimate on the effects of different parameters, but networks used in any real-world applications should be trained for the correct amount of time.

During the work on this thesis, a new object detection network called EfficientDet [40] was released. It promises efficient object detection compared to other state-of-the-art detec-tors. Studying its performance in mobile applications would be beneficial and it might prove to be better than other lightweight detectors at the moment. The ideas presented in this work could be used to further optimize the EfficientDet model to get extremely efficient detections.

Neural networks have come a long way in the recent years. Especially performing object detection with neural networks might have been close to impossible only a dozen years ago. These days even small computers have a large computational capacity and can run fairly complicated networks. Still, focusing on efficiency when designing neural network architectures can unlock even more interesting applications if the hardware costs can be brought down.

The trade-off between speed and accuracy needs to be considered carefully for each application. Critical applications concerning the safety of people need special consider-ation. Self-driving cars require both speed and accuracy to ensure the safety of traffic.

Detection of diseases and anomalies in medical imaging requires as much accuracy as possible, but speed is not relevant. Non-critical applications, such as counting people in public spaces might require real-time speed. However, losing a few detections due to the model’s inaccuracy does not severely affect the total estimates.

The results from this work can be used to make parameter selection faster in future implementations of neural networks. Training multiple detector models takes a lot of time, so knowing the correct parameters beforehand for training a model can save time and resources when developing an object detection solution.

REFERENCES

[1] Li, Y., Chen, Y., Wang, N. and Zhang, Z.-X. Scale-aware trident networks for object detection. Vol. 2019-October. 2019, 6053–6062.

[2] Singh, B., Najibi, M. and Davis, L. Sniper: Efficient multi-scale training. Vol. 2018-December. 2018, 9310–9320.

[3] Bishop, C. M. Pattern recognition and machine learning. Information science and statistics. Softcover published in 2016. New York, NY: Springer, 2006.

[4] Goodfellow, I., Bengio, Y. and Courville, A.Deep Learning. MIT Press, 2016.

[5] Kohavi, R. and Provost, F. Glossary of Terms.Machine Learning 30 (1998), 271–

274.

[6] Hui, J. mAP (mean Average Precision) for Object Detection.Medium(2018).

[7] Nielsen, M. A.Neural networks and deep learning. Vol. 2018. Determination press San Francisco, CA, USA: 2015.

[8] He, K., Zhang, X., Ren, S. and Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.arXiv (2015).

[9] Smith, S. W. CHAPTER 6 - Convolution. Digital Signal Processing. Ed. by S. W.

Smith. Boston: Newnes, 2003, 107–122.

[10] Zeiler, M. D. and Fergus, R. Visualizing and Understanding Convolutional Networks.

arXiv (2013).

[11] He, K., Zhang, X., Ren, S. and Sun, J. Deep Residual Learning for Image Recog-nition.CoRRabs/1512.03385 (2015).

[12] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C. and Fei-Fei, L. ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV) 115.3 (2015), 211–252.

[13] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-dreetto, M. and Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.CoRRabs/1704.04861 (2017).

[14] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A. and Chen, L.-C. MobileNetV2:

Inverted Residuals and Linear Bottlenecks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2018.

[15] Iandola, F. N., Moskewicz, M. W., Ashraf, K., Han, S., Dally, W. J. and Keutzer, K.

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size.CoRRabs/1602.07360 (2016).

[16] Han, S., Mao, H. and Dally, W. J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.arXiv (2015).

[17] Zhang, X., Zhou, X., Lin, M. and Sun, J. ShuffleNet: An Extremely Efficient Convo-lutional Neural Network for Mobile Devices.CoRRabs/1707.01083 (2017).

[18] Krizhevsky, A., Sutskever, I. and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks.Communications of the ACM 60.6 (2017), 84–90.

[19] Ma, N., Zhang, X., Zheng, H.-T. and Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design.arXiv (2018).

[20] Dalal, N. and Triggs, B. Histograms of oriented gradients for human detection.2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). Vol. 1. June 2005, 886–893 vol. 1.

[21] Viola, P. and Jones, M. J. Robust Real-Time Face Detection. English.International Journal of Computer Vision57.2 (2004), 137–154.

[22] Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara, A., Fathi, A., Fischer, I., Wo-jna, Z., Song, Y., Guadarrama, S. and Murphy, K. Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors.The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). July 2017.

[23] Ren, S., He, K., Girshick, R. and Sun, J.Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Ed. by C. Cortes, N. D. Lawrence, D. D.

Lee, M. Sugiyama and R. Garnett. Curran Associates, Inc., 2015, 91–99.

[24] Girshick, R. Fast r-cnn.Proceedings of the IEEE international conference on com-puter vision. 2015, 1440–1448.

[25] Dai, J., Li, Y., He, K. and Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks.Advances in Neural Information Processing Systems 29.

Ed. by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett. Curran Associates, Inc., 2016, 379–387.

[26] Long, J., Shelhamer, E. and Darrell, T. Fully Convolutional Networks for Semantic Segmentation.The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2015.

[27] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y. and Berg, A. C.

SSD: Single Shot MultiBox Detector. Lecture Notes in Computer Science (2016), 21–37.

[28] Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. You only look once: Unified, real-time object detection.Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, 779–788.

[29] Redmon, J. and Farhadi, A. YOLO9000: Better, Faster, Stronger.CoRR(2016).

[30] Redmon, J. and Farhadi, A. YOLOv3: An Incremental Improvement.CoRRabs/1804.02767 (2018).URL:http://arxiv.org/abs/1804.02767.

[31] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Van-houcke, V. and Rabinovich, A. Going Deeper with Convolutions.Computer Vision and Pattern Recognition (CVPR). 2015.URL:http://arxiv.org/abs/1409.4842.

[32] Law, H. and Deng, J. CornerNet: Detecting Objects as Paired Keypoints.The Eu-ropean Conference on Computer Vision (ECCV). Sept. 2018.

[33] Law, H., Teng, Y., Russakovsky, O. and Deng, J. CornerNet-Lite: Efficient Keypoint Based Object Detection.CoRR abs/1904.08900 (2019).URL:http://arxiv.org/

abs/1904.08900.

[34] Bodla, N., Singh, B., Chellappa, R. and Davis, L. S. Soft-NMS – Improving Object Detection With One Line of Code.arXiv (2017).

[35] Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J. and Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge.International Journal of Computer Vision88.2 (June 2010), 303–338.

[36] Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., Perona, P., Ramanan, D., Dollár, P. and Zitnick, C. L. Microsoft COCO: Common Objects in Context.CoRRabs/1405.0312 (2014).

[37] Geiger, A., Lenz, P., Stiller, C. and Urtasun, R. Vision meets Robotics: The KITTI Dataset.International Journal of Robotics Research (IJRR)(2013).

[38] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Jia, Y., Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Van-houcke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu and Xiaoqiang Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. (2015). URL:https://www.

tensorflow.org/.

[39] Speeding Up Deep Learning Inference Using TensorRT. Nvidia Developer Blog (2019).

[40] Tan, M., Pang, R. and Le, Q. V. EfficientDet: Scalable and Efficient Object Detec-tion.arXiv (2019).

In document Efficient Deep Learning for Person Detection (sivua 51-57)