• Ei tuloksia

The results achieved by the proposed system are impressive. The BD-rate gain versus the state-of-the-art video coding standard VVC/H.266,−40.49%, is significant on the low bitrate range [0.012,0.072]. On a higher compression, the feature residual codec was able to learn to retain important details, while discarding unimportant details. Also, the optimal bitrate for decoded feature residuals (fˆ

r) was found to be on the range[0.01,0.02].

Nevertheless, numerous ways to potentially improve upon and extend the proposed sys-tem were detected. Most importantly, the next logical step in extending the proposed system would be to expand it for multiple machine tasks. In particular, instance segmen-tation is a closely associated task with object detection – meaning that the task networks for both tasks could, without a doubt, share the same feature extraction stage [31]. Since the cut point of the proposed system is relatively close to the input, it would likely be pos-sible to extend the system with various computer vision tasks that are less comparable with the object detection task. In fact, experimenting with different split points for the task network would be interesting; there is no guarantee that the chosen split point of the pro-posed system optimizes the generality and compressibility traits discussed in the section 3.3.

Another extension for the proposed system would be to test the system with other datasets, such as the Open Images Dataset [48], which consists of around 9 million varied images with rich annotations. This would be beneficial to alleviate potential problems with the used Cityscapes dataset, or to simply provide an additional dataset to make the obtained results of the proposed method more reliable.

Although the proposed system was implemented for image data, it should be possible, in principle, to transform the system to operate in the video domain. This process can be seen as a natural following step. A less apparent extension, but an interesting idea, would be to use the proposed system to enhance the human stream, i.e., target human consumption in an architecture with a feedback connection from the machine stream to the human stream.

The conducted experiments did not cover every possible setting and hyper parameter combination. Settings with α = 1 and α = 2 were only carried forQP = 48. Only one downscaling parameter was used – it might have been interesting to see how the results look like with resolution percentages25% and75%. The training strategy, although hav-ing a definite rationale behind it, is not exhaustively optimized, and could be altered. For

example, the choices of the parametersα,βandγ were made on the basis of trends over 10 to 20 different experiments. The architecture of the feature residual codec has proven to work in the conducted experiments, but it could probably be optimized with some neu-ral architecture search (NAS). Using a deeper entropy model could improve the results further, with the trade-off of longer training times. Ultimately, there are probably countless other design choices and hyper parameters that could be adjusted to improve the perfor-mance of the proposed system. However, this statement is not meant to undermine the superior results achieved.

6 CONCLUSIONS

Image compression for machines has emerged recently – though the possibility for hu-man supervision is often also required. The objective was to propose a new image cod-ing technique for machines that fulfills this requirement and surpasses traditional image codecs’ performance in computer vision applications for a given bitrate. This thesis pro-posed a method that accomplishes the objective. By combining features of highly com-pressed images with comcom-pressed feature residuals, the implemented end-to-end system outperformed the current state-of-the-art video coding standard VVC/H.266, achieving an average BD-rate reduction of−40.5% in the low bitrate range of[0.01,0.07].

The proposed system consists of two branches: (1) the human branch, a state-of-the-art traditional video codec that provides the compressed image for human consumption;

(2) the machine branch, an end-to-end learned neural network-based video compression system that provides feature residuals to improve the performance of machine tasks.

The machine branch, which provides an enhancement to the task performance, can be switched off when the human branch can generate satisfactory performance solely. It has been observed that the feature residual codec achieves a high compression of the fea-ture residuals by discarding texfea-tures in background regions while retaining the important information at object regions. The optimal decoded feature residual bitrate was between 0.01and0.02BPP in all conducted experiments.

In the future, the proposed method could be extended for multiple machine tasks. The system performance could probably be further improved by tuning the training hyperpa-rameters, applying a more tailored loss weighting strategy and using a deeper entropy model, among others. Furthermore, neural network-based video coding has been devel-oped rapidly over the last few years. Extending the proposed method to video coding for machines has become technically feasible and promising.

REFERENCES

[1] O. Aftab, P. Cheung, A. Kim, S. Thakkar and N. Yeddanapudi. Information The-ory: Information Theory and the Digital Age. Final paper, Project History, Mas-sachusetts Institute of Technology (2001).

[2] S. P. Bacon. Overview of auditory compression. Compression: From Cochlea to Cochlear Implants. Springer, 2004, 1–17.

[3] M. Bai and R. Urtasun. Deep watershed transform for instance segmentation. Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

2017, 5221–5229.

[4] J. Ballé, D. Minnen, S. Singh, S. J. Hwang and N. Johnston. Variational image compression with a scale hyperprior.International Conference on Learning Rep-resentations. 2018.

[5] C. M. Bishop.Pattern recognition and machine learning. springer, 2006.

[6] G. Bjontegaard. Calculation of average PSNR differences between RD-curves.

VCEG-M33(2001).

[7] G. Bjontegaard. Improvements of the BD-PSNR model, VCEG-AI11. ITU-T Q.

6/SG16, 34th VCEG Meeting, Berlin, Germany (July 2008). 2008.

[8] Y. Blau and T. Michaeli. Rethinking lossy compression: The rate-distortion-perception tradeoff.International Conference on Machine Learning. PMLR. 2019, 675–685.

[9] J. Boelaert and É. Ollion. The Great Regression.Revue française de sociologie 59.3 (2018), 475–506.

[10] L. Bottou. Large-scale machine learning with stochastic gradient descent. Pro-ceedings of COMPSTAT’2010. Springer, 2010, 177–186.

[11] T. Boutell.RFC2083: PNG (Portable Network Graphics) Specification Version 1.0.

1997.

[12] K. Brandenburg. MP3 and AAC explained.Audio Engineering Society Conference:

17th International Conference: High-Quality Audio Coding. Audio Engineering So-ciety. 1999.

[13] B. Bross, J. Chen, S. Liu and Y.-K. Wang. Versatile Video Coding (Draft 8).Joint Video Experts Team (JVET), Document JVET-Q2001(Jan. 2020).

[14] M. Cadik and P. Slavik. Evaluation of two principal approaches to objective image quality assessment.Proceedings. Eighth International Conference on Information Visualisation, 2004. IV 2004.IEEE. 2004, 513–518.

[15] L. D. Chamain, F. Racapé, J. Bégaint, A. Pushparaja and S. Feltman. End-to-end optimized image compression for machines, a study.arXiv:2011.06409 [cs, eess]

(Nov. 2020). arXiv: 2011.06409.URL:http://arxiv.org/abs/2011.06409.

[16] L. D. Chamain, F. Racapé, J. Bégaint, A. Pushparaja and S. Feltman. End-to-end optimized image compression for multiple machine tasks. arXiv:2103.04178 [cs]

(Mar. 2021). arXiv: 2103.04178.URL:http://arxiv.org/abs/2103.04178.

[17] Z. Cheng, H. Sun, M. Takeuchi and J. Katto. Learned Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Modules. 2020. arXiv:

2001.01568 [eess.IV].

[18] Y. S. Chong and Y. H. Tay. Abnormal Event Detection in Videos Using Spatiotem-poral Autoencoder. Advances in Neural Networks - ISNN 2017. Ed. by F. Cong, A. Leung and Q. Wei. Springer International Publishing, 2017, 189–196. ISBN: 978-3-319-59081-3.

[19] Cisco Annual Internet Report (2018–2023). URL:https : / / www . cisco . com / c / en / us / solutions / collateral / executive perspectives / annual internet -report/white-paper-c11-741490.html.

[20] COCO - Common Objects in Context (Detection Leaderboard). 2021.URL:https:

//cocodataset.org/#detection-leaderboard(visited on 2021).

[21] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U.

Franke, S. Roth and B. Schiele. The cityscapes dataset for semantic urban scene understanding.Proceedings of the IEEE conference on computer vision and pat-tern recognition. 2016, 3213–3223.

[22] G. Danuser. Computer vision in cell biology.Cell 147.5 (2011), 973–978.

[23] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition. Ieee. 2009, 248–255.

[24] P. Dollar, C. Wojek, B. Schiele and P. Perona. Pedestrian detection: An evalua-tion of the state of the art. IEEE transactions on pattern analysis and machine intelligence34.4 (2011), 743–761.

[25] L.-Y. Duan, J. Liu, W. Yang, T. Huang and W. Gao.Video Coding for Machines:

A Paradigm of Collaborative Compression and Intelligent Analytics. 2020. DOI: 10.1109/TIP.2020.3016485.

[26] R. O. Duda, P. E. Hart and D. G. Stork.Pattern classification. John Wiley & Sons, 2012.

[27] K. Fischer, F. Brand, C. Herglotz and A. Kaup. Video Coding for Machines with Feature-Based Rate-Distortion Optimization.2020 IEEE 22nd International Work-shop on Multimedia Signal Processing (MMSP). Sept. 2020, 1–6. DOI:10.1109/

MMSP48831.2020.9287136.

[28] M. Fowler.UML distilled: a brief guide to the standard object modeling language.

Addison-Wesley Professional, 2004.

[29] Z. Ghahramani. Unsupervised learning. Summer School on Machine Learning.

Springer. 2003, 72–112.

[30] R. Girshick. Fast r-cnn.Proceedings of the IEEE international conference on com-puter vision. 2015, 1440–1448.

[31] R. Girshick, J. Donahue, T. Darrell and J. Malik. Rich feature hierarchies for accu-rate object detection and semantic segmentation.Proceedings of the IEEE con-ference on computer vision and pattern recognition. 2014, 580–587.

[32] X. Glorot, A. Bordes and Y. Bengio. Deep sparse rectifier neural networks. Pro-ceedings of the fourteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings. 2011, 315–323.

[33] T. Gong, T. Lee, C. Stephenson, V. Renduchintala, S. Padhy, A. Ndirango, G.

Keskin and O. H. Elibol. A comparison of loss weighting strategies for multi task learning in deep neural networks.IEEE Access7 (2019), 141627–141632.

[34] I. Goodfellow, Y. Bengio and A. Courville.Deep Learning.http://www.deeplearningbook.

org. MIT Press, 2016.

[35] V. K. Goyal. Theoretical foundations of transform coding.IEEE Signal Processing Magazine18.5 (2001), 9–21.

[36] S. R. Granter, A. H. Beck and D. J. Papke Jr. AlphaGo, deep learning, and the future of the human microscopist. Archives of pathology & laboratory medicine 141.5 (2017), 619–621.

[37] Y. Guo, H. Shi, A. Kumar, K. Grauman, T. Rosing and R. Feris.SpotTune: Transfer Learning through Adaptive Fine-tuning. 2018. arXiv:1811.08737 [cs.CV].

[38] P. C. Gupta. Data communications and computer networks. PHI Learning Pvt.

Ltd., 2013.

[39] K. He, G. Gkioxari, P. Dollár and R. Girshick. Mask R-CNN. Oct. 2017. DOI:10.

1109/ICCV.2017.322.

[40] K. He, X. Zhang, S. Ren and J. Sun.Deep Residual Learning for Image Recogni-tion. ISSN: 1063-6919. June 2016. DOI:10.1109/CVPR.2016.90.

[41] W. Hu, N. Xie, L. Li, X. Zeng and S. Maybank. A Survey on Visual Content-Based Video Indexing and Retrieval.IEEE Transactions on Systems, Man, and Cyber-netics, Part C (Applications and Reviews) 41.6 (2011), 797–819. DOI:10.1109/

TSMCC.2011.2109710.

[42] P. Johnston, E. Elyan and C. Jayne. Spatial effects of video compression on clas-sification in convolutional neural networks.2018 International Joint Conference on Neural Networks (IJCNN). IEEE. 2018, 1–8.

[43] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980(2014).

[44] A. Krizhevsky, I. Sutskever and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing sys-tems25 (2012), 1097–1105.

[45] S. Kullback.Information theory and statistics. Courier Corporation, 1997.

[46] S. N. Kumar, M. V. Bharadwaj and S. Subbarayappa. Performance Comparison of Jpeg, Jpeg XT, Jpeg LS, Jpeg 2000, Jpeg XR, HEVC, EVC and VVC for Images.

2021 6th International Conference for Convergence in Technology (I2CT). IEEE.

2021, 1–8.

[47] S. K. Kumar. On weight initialization in deep neural networks.arXiv preprint arXiv:1704.08863 (2017).

[48] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S.

Popov, M. Malloci, A. Kolesnikov et al. The open images dataset v4.International Journal of Computer Vision(2020), 1–26.

[49] G. G. Langdon. An introduction to arithmetic coding.IBM Journal of Research and Development 28.2 (1984), 135–149.

[50] N. Le, H. Zhang, F. Cricri, R. Ghaznavi-Youvalari and E. Rahtu. Image Coding for Machines: An End-to-End Learned Approach. (in press). 2021.

[51] W.-C. Lee and H.-M. Hang. A hybrid image codec with learned residual cod-ing. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020, 138–139.

[52] P. Lennie. The cost of cortical computation.Current biology 13.6 (2003), 493–497.

[53] F.-F. Li, A. Karpathy and J. Johnson.Cs231n: Convolutional neural networks for vi-sual recognition 2016. 2016.URL:https://cs231n.github.io/classification/.

[54] J. Liang, C. Tu and T. D. Tran. Optimal pre-and post-processing for JPEG2000 tiling artifact removal. Conference on Information Sciences and Systems, The Johns Hopkins University, March. 2003.

[55] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ra-manan, C. L. Zitnick and P. Dollár. Microsoft COCO: Common Objects in Context.

arXiv:1405.0312 [cs](Feb. 20, 2015). arXiv:1405.0312.URL:http://arxiv.org/

abs/1405.0312(visited on 10/10/2020).

[56] H. Liu, M. Lu, Z. Ma, F. Wang, Z. Xie, X. Cao and Y. Wang. Neural Video Cod-ing usCod-ing Multiscale Motion Compensation and Spatiotemporal Context Model.

arXiv:2007.04574 [cs, eess] (July 2020). arXiv: 2007.04574.URL:http://arxiv.

org/abs/2007.04574.

[57] A. V. Lotov and K. Miettinen. Visualizing the Pareto frontier. Multiobjective opti-mization. Springer, 2008, 213–243.

[58] S. Ma, X. Zhang, C. Jia, Z. Zhao, S. Wang and S. Wang. Image and Video Com-pression With Neural Networks: A Review.IEEE Transactions on Circuits and Sys-tems for Video Technology 30.6 (June 2020), 1683–1698.ISSN: 1558-2205.DOI: 10.1109/tcsvt.2019.2910119.URL:http://dx.doi.org/10.1109/TCSVT.2019.

2910119.

[59] A. L. Maas, A. Y. Hannun, A. Y. Ng et al. Rectifier nonlinearities improve neural network acoustic models.Proc. icml. Vol. 30. 1. Citeseer. 2013, 3.

[60] M. W. Marcellin, M. J. Gormish, A. Bilgin and M. P. Boliek. An overview of JPEG-2000.Proceedings DCC 2000. Data Compression Conference. IEEE. 2000, 523–

541.

[61] D. Marpe, H. Schwarz and T. Wiegand. Context-based adaptive binary arithmetic coding in the H. 264/AVC video compression standard.IEEE Transactions on cir-cuits and systems for video technology 13.7 (2003), 620–636.

[62] D. Minnen, J. Ballé and G. D. Toderici.Joint Autoregressive and Hierarchical Priors for Learned Image Compression. 2018. URL:http://papers.nips.cc/paper/

8275-joint-autoregressive-and-hierarchical-priors-for-learned-image-compression.pdf.

[63] G. Montavon, W. Samek and K.-R. Müller. Methods for interpreting and under-standing deep neural networks.Digital Signal Processing 73 (2018), 1–15.ISSN: 1051-2004.DOI:https://doi.org/10.1016/j.dsp.2017.10.011.

[64] K. P. Murphy.Machine learning: a probabilistic perspective. MIT press, 2012.

[65] A. A. Nashat and N. H. Hassan. Image compression based upon wavelet trans-form and a statistical threshold.2016 International Conference on Optoelectronics and Image Processing (ICOIP). IEEE. 2016, 20–24.

[66] A. Neubauer, J. Freudenberger and V. Kuhn.Coding theory: algorithms, architec-tures and applications. John Wiley & Sons, 2007.

[67] F. Pakdaman, M. A. Adelimanesh, M. Gabbouj and M. R. Hashemi. Complexity analysis of next-generation vvc encoding and decoding.2020 IEEE International Conference on Image Processing (ICIP). IEEE. 2020, 3134–3138.

[68] O. M. Parkhi, A. Vedaldi, A. Zisserman and C. Jawahar. Cats and dogs.2012 IEEE conference on computer vision and pattern recognition. IEEE. 2012, 3498–3505.

[69] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z.

Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Rai-son, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai and S. Chintala. Py-Torch: An Imperative Style, High-Performance Deep Learning Library.Advances in Neural Information Processing Systems 32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox and R. Garnett. Curran Associates, Inc., 2019, 8024–8035. URL: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.

[70] N. Patwa, N. Ahuja, S. Somayazulu, O. Tickoo, S. Varadarajan and S. Koolagudi.

Semantic-Preserving Image Compression. 2020 IEEE International Conference on Image Processing (ICIP). Oct. 2020, 1281–1285. DOI:10.1109/ICIP40778.

2020.9191247.

[71] W. B. Pennebaker and J. L. Mitchell. JPEG: Still image data compression stan-dard. Springer Science & Business Media, 1992.

[72] A. Puri, X. Chen and A. Luthra. Video coding using the H. 264/MPEG-4 AVC com-pression standard. Signal processing: Image communication 19.9 (2004), 793–

849.

[73] S. Ren, K. He, R. Girshick and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. (2015), 91–99. (Visited on 10/19/2020).

[74] F. M. Reza.An introduction to information theory. Courier Corporation, 1994.

[75] S. Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747 (2016).

[76] J. Schmidhuber. Deep learning in neural networks: An overview.Neural networks 61 (2015), 85–117.

[77] C. E. Shannon. A mathematical theory of communication.The Bell system tech-nical journal27.3 (1948), 379–423.

[78] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert and Z. Wang.Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. 2016. arXiv:1609.05158 [cs.CV].

[79] R. Simhambhatla, K. Okiah, S. Kuchkula and R. Slater. Self-driving cars: Evalua-tion of deep learning techniques for object detecEvalua-tion in different driving condiEvalua-tions.

SMU Data Science Review 2.1 (2019), 23.

[80] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556 (2014).

[81] M. Sonka, V. Hlavac and R. Boyle. Image processing, analysis, and machine vi-sion. Cengage Learning, 2014.

[82] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov. Dropout:

a simple way to prevent neural networks from overfitting.The journal of machine learning research15.1 (2014), 1929–1958.

[83] G. J. Sullivan, J.-R. Ohm, W.-J. Han and T. Wiegand. Overview of the high effi-ciency video coding (HEVC) standard.IEEE Transactions on circuits and systems for video technology 22.12 (2012), 1649–1668.

[84] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-houcke and A. Rabinovich. Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, 1–9.

[85] R. Szeliski. Computer vision: algorithms and applications. Springer Science &

Business Media, 2010.

[86] T. K. Tan, R. Weerakkody, M. Mrak, N. Ramzan, V. Baroncini, J.-R. Ohm and G. J.

Sullivan. Video quality evaluation methodology and verification testing of HEVC compression performance.IEEE Transactions on Circuits and Systems for Video Technology 26.1 (2015), 76–90.

[87] torchvision — Torchvision master documentation. 2021.URL:https://pytorch.

org/vision/stable/(visited on 2021).

[88] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko and T. Darrell.Deep Domain Confu-sion: Maximizing for Domain Invariance. 2014. arXiv:1412.3474 [cs.CV].

[89] J. R. Uijlings, K. E. Van De Sande, T. Gevers and A. W. Smeulders. Selective search for object recognition.International journal of computer vision104.2 (2013), 154–171.

[90] J. Vanne, M. Viitanen, T. D. Hamalainen and A. Hallapuro. Comparative rate-distortion-complexity analysis of HEVC and AVC video codecs.IEEE Transactions on Circuits and Systems for Video Technology 22.12 (2012), 1885–1898.

[91] Z. Wang, A. C. Bovik and L. Lu. Why is image quality assessment so difficult?:

2002 IEEE International Conference on Acoustics, Speech, and Signal Process-ing. Vol. 4. IEEE. 2002, IV–3313.

[92] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli. Image quality assess-ment: from error visibility to structural similarity.IEEE transactions on image pro-cessing 13.4 (2004), 600–612.

[93] A. G. Weber. The USC-SIPI image database version 5.USC-SIPI Report 315.1 (1997).

[94] G. Wilson and D. J. Cook.A Survey of Unsupervised Deep Domain Adaptation.

2020. arXiv:1812.02849 [cs.LG].

[95] Y. Xing, M. Kaaniche, B. Pesquet-Popescu and F. Dufaux.Digital holographic data representation and compression. Academic Press, 2015.

[96] R. Yang, F. Mentzer, L. Van Gool and R. Timofte. Learning for Video Compres-sion with Hierarchical Quality and Recurrent Enhancement.arXiv:2003.01966 [cs, eess] (Apr. 2020). arXiv: 2003.01966.URL:http://arxiv.org/abs/2003.01966.

[97] R. Yang, F. Mentzer, L. Van Gool and R. Timofte. Learning for Video Compression with Recurrent Auto-Encoder and Recurrent Probability Model.arXiv:2006.13560 [cs, eess] (June 2020). arXiv: 2006.13560. URL:http://arxiv.org/abs/2006.

13560.

[98] Z. Yang, Y. Wang, C. Xu, P. Du, C. Xu, C. Xu and Q. Tian. Discernible Image Com-pression.Proceedings of the 28th ACM International Conference on Multimedia.

2020, 1561–1569.

[99] H. Zhang, F. Cricri, H. R. Tavakoli, N. Zou, E. Aksu and M. M. Hannuksela. Loss-less Image Compression Using a Multi-Scale Progressive Statistical Model. Pro-ceedings of the Asian Conference on Computer Vision (ACCV). Nov. 2020.

[100] Y. Zhang, M. Rafie and S. Liu. Use cases and require-ments for video coding for machines.ISO/IEC JTC1/SC 29/WG 2 (Jan. 2021).

[101] J. Ziv and A. Lempel. A universal algorithm for sequential data compression.IEEE Transactions on information theory 23.3 (1977), 337–343.