Enhancing Image Coding for Machines with Compressed Feature Residuals

(1)

ENHANCING IMAGE CODING FOR MACHINES WITH COMPRESSED FEATURE RESIDUALS

Faculty of Information Technology and Communication Sciences Master’s Thesis July 2021

(2)

ABSTRACT

Joni Seppälä: Enhancing Image Coding for Machines with Compressed Feature Residuals Master’s Thesis

Tampere University

Master’s Programme in Information Technology July 2021

Media data, such as images and videos, are produced and consumed in enormous quantities on a daily basis. Their volume has increased by several magnitudes over the years, and this trend can be foreseen to continue in the future. However, the storage and transmission of the media utilize resources that are costly and almost always limited.

Traditionally, video and image data have been produced for human consumption. Therefore, the conventional image and video codecs have been optimized to spare as much storage space as possible with compression while minimizing human perceived deterioration of the data. As computer vision based technologies have become increasingly more common over the last decade, a new paradigm has emerged, in which the data is compressed to be used primarily by machines.

The codecs designed for machine consumption commonly undertake a more aggressive compression than the traditional codecs, saving more storage space. Usually, the employed metric of the image quality is the computer vision task performance – in which case the perceptual quality of the images may be ignored, leading to a decline in human perceivable quality. Although machines are the primary consumers in many use cases, human involvement is also convenient or even mandatory. For example, in some expert systems, the machines must provide visualizations to justify generated results for human observers.

This thesis proposes a novel image coding technique targeted for machines while maintaining the capability for human consumption. The proposed codec generates two bitstreams: one bitstream from a traditional codec, referred to as human bitstream, optimized for human consumption; the other bitstream, referred to as machine bitstream, generated from an end-to-end learned neural network-based codec and optimized for machine tasks. Instead of working on the image domain, the proposed machine bitstream is derived from feature residuals – the difference between the features extracted from the input image and the features extracted from the reconstructed image generated by the traditional codec.

With the help of the machine bitstream, the machine task performance was significantly improved in the low bitrate range. The proposed system beats the state-of-the-art traditional codec, theVersatile Video Coding (VVC/H.266), achieving−40.5% in Bjøntegaard delta bitrate reduction average for bitrates up to0.07BPP in the object detection task in theCityscapesdataset.

Keywords: image coding for machines, image compression, residual coding, deep learning, autoencoder, computer vision, loss weighting

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

TIIVISTELMÄ

Joni Seppälä: Enhancing Image Coding for Machines with Compressed Feature Residuals Diplomityö

Tampereen yliopisto Tietotekniikan DI-ohjelma Heinäkuu 2021

Mediadataa, kuten kuvia ja videoita, tuotetaan ja kulutetaan suunnattomia määriä päivittäin.

Määrä on kasvanut vuosien varrella useita kertaluokkia, ja näin odotetaan tapahtuvan myös tule- vaisuudessa. Toisaalta median säilytykseen ja lähettämiseen kuluu resursseja, jotka ovat arvok- kaita ja lähes aina rajallisia.

Perinteisesti video- ja kuvadataa ollaan tuotettu pääasiassa ihmisten kulutettavaksi. Täten va- kiintuneet kuva- ja videokoodekit ollaan optimoitu säästämään mahdollisimman paljon digitaalista säilytystilaa, kuten levymuistia niin, että kompressoinnista aiheutuva kuvanlaadun heikkeneminen olisi mahdollisimman huomaamatonta ihmisille. Kuitenkin, konenäköön littyvät teknologiat ovat yleistyneet huomattavasti viime vuosikymmenten aikana. Tämän johdosta kuva- ja videokoodauk- sen alalle on syntynyt uusi paradigma, missä kuvadataa kompressoidaan ensisijaisesti koneiden käytettäväksi. Koneille suunnatut koodekit eroavat perinteisistä koodekeista muunmuassa siten, että ne usein pyrkivät kompressoimaan dataa aggressiivisemmin, säästäen enemmän digitaalista säilytystilaa. Kuvanlaadun metriikkana käytetään useimmiten konenäön tehtävien suorituskykyä, jolloin kuvien laadun voidaan antaa heikentyä ihmisten perspektiivistä. Kuitenkin, monissa tapauk- sissa koneille suunnattujen koodekkien on myös tuotettava ihmisten käyttöön kelpaavia kuvia. Tä- mä pätee esimerkiksi joissain asiantuntijajärjestelmissä, joissa koneiden on generoidun tuloksen lisäksi pystyttävä tuottamaan ihmisen ymmärrettävä visualisaatio, jonka perusteella tuloksen oi- keellisuutta on mahdollista tarkistella.

Tässä opinnäytetyössä esitellään innovatiivinen kuvakoodekki koneille, joka myös säilyttää korkean kuvalaadun ihmisille. Koodekki koostuu kahdesta bittivirrasta, ihmisille ja koneille opti- moiduista bittivirroista. Ihmisoptimoitu bittivirta luodaan käyttäen perinteistä koodekkia; koneoptimoitu bittivirta on generoitu päästä päähän koulutetulla neuroverkkoihin pohjautuvalla koode- killa (engl. end-to-end learned neural network-based codec), joka on optimoitu konenäön tehtä- viin. Tarkemmin, koneoptimoitu bittivirta koostuu ominaisuus-jäännöksistä (engl. feature residual) – syötekuvan ja perinteisen koodekin rekonstruktoiman kuvan poimittujen ominaisuuksien (engl.

feature extraction) erotuksista.

Koneoptimoidun bittivirran avulla esitellyn systeemin suorituskyky parani huomattavsti matalil- la bittitiheyksillä (engl. bitrate). Systeemi päihittää objektintunnistustehtävässä viimeisintä tekniik- kaa edustavan perinteisen koodekin,Versatile Video Coding (VVC/H.266), saavuttaen keskimää- rin −40.5% Bjøntegaard delta bittitiheyden vähennyksen Cityscapes datasetissä bittitiheyksille, jotka ovat alle0.07bittiä pikseliä kohden (engl. bits per pixel, BPP).

Avainsanat: kuvan koodaus koneille, kuvanpakkaus, jäännöskoodaus, syväoppiminen, autoen- kooderi, konenäkö, tappiofunktion painotus

Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck -ohjelmalla.

(4)

PREFACE

The work presented in this thesis has been conducted at Nokia Technologies and made as a completion of a Master’s degree programme in Information Technology by Tampere University. It is also based on a submission to ACMM 2021 conference with the same title.

I would like to express my profound gratitude to all my teachers, supervisors and colleagues that helped me reach this point. Particularly, I would like to thank my supervisor from Tampere University, Esa Rahtu, for overseeing the work, giving helpful feedback and maintaining my optimism, despite the demanding schedule I had set for the thesis.

My most sincere gratitude goes for all my colleagues in Nokia. Especially, I would like to thank my colleagues Honglei Zhang, Nam Le, Ramin Ghaznavi-Youvalari, Francesco Cricri and Hamed Rezazadegan Tavakoli for contributing to the conference paper to which this thesis is based on. Miska Hannuksela is to be credited for the invention of the original idea which was the starting point of this research. I would like to offer my special thanks to my supervisors at Nokia, Honglei and Francesco, whose support over the past year can not be overstated. Given their immeasurable wisdom, helpfulness and positive attitude, I could not wish for better supervisors.

Last, but not least, I want to thank my friends and family for endlessly supporting and motivating me throughout my studies. Undeniably, their role has been invaluable in en- couraging me into overcoming any obstacles and achieving my dreams. As a final note, my brother, Tomi, stated in his thesis’ preface that it was of crucial importance for him that he bested me in the (friendly) race to the MSc. Although I was not able to graduate before him, I proudly think I was still able to outdo myself. It is left open for the reader to decide which one is more important.

Tampere, 5th July 2021 Joni Seppälä

(5)

LIST OF FIGURES

2.1 Surprisal of a message m with a probability ofp(x) . . . 4

2.2 A fundamental compression system . . . 5

2.3 Huffman algorithm for binary sources, example . . . 6

2.4 Arithmetic encoding, simplified example . . . 7

2.5 Quantization over temporal and numerical axis, example . . . 8

2.6 Histograms before and after decorrelation, example . . . 8

2.7 Typical image artifact, example . . . 9

2.8 Artificial neuron model . . . 12

2.9 Tanh, ReLU and LReLU . . . 13

2.10 Generic ANN supervised learning process . . . 16

2.11 Gradient descent, example . . . 18

2.12 Different fits, example . . . 20

2.13 Natural image composes of recursive instances, example . . . 21

2.14 Typical CNN layer, with example . . . 22

2.15 Architecture of the Faster R-CNN . . . 24

2.16 Autoencoder structure . . . 25

2.17 Confusion matrix . . . 27

2.18 mAP calculation example . . . 29

2.19 VVC encoding architecture . . . 31

2.20 Extended VVC/H.266 coding pipe . . . 32

2.21 Residual coding on feature space . . . 34

2.22 Pareto front demonstration . . . 38

2.23 BD-rate computation illustration . . . 39

3.1 Architecture of the system . . . 41

3.2 Feature residual codec . . . 43

3.3 Feature residual codec hierarchy . . . 44

3.4 A conceptual task network . . . 45

3.5 Loss weighting strategy . . . 48

4.1 Rate-accuracy curve of the experiments that outperformed the baseline . . 53

4.2 Proportional mAP gain of the proposed system in comparison to VVC/H.266 54 4.3 Images and features derived at different stages . . . 57

4.4 Decoded features residuals by models on different bitrate targets . . . 58

4.5 Object detection predictions comparison . . . 59

4.6 Ground truth detections of an example image . . . 59

(8)

LIST OF TABLES

4.1 Settings of VVC/H.266 (human branch). The numbers in the Setting column denote resolution and the quantization parameter (QP). . . 51 4.2 Front range, front BD-rate of models trained with different settings

and the system BD-rate. The front range is given as [lower boundary, upper boundary]. . . 55

(9)

LIST OF SYMBOLS AND ABBREVIATIONS

AI artificial intelligence (intelligence demonstrated by machines) ANN artificial neural network (model for computational processing that

is based on connectionism)

AVC/H.264 Advanced Video Coding (video codec)

BD Bjøntegaard delta (metric, represents the average bit rate savings for the same video quality)

BP backpropagation (algorithm for training feedforward neural networks)

BPP bits per pixel (metric, number of bits of information stored per pixel of an image)

CABAC context-based adaptive binary arithmetic coding (form of entropy encoding)

CNN convolutional neural network (class of deep neural network, commonly applied for visual tasks)

COCO Common Objects in Context (image dataset)

CV computer vision (the study of how computers can understand visual data)

DCT discrete cosine transform (transformation technique in signal processing and data compression)

DNN deep neural network (artificial neural network with more than 3 layers)

DST discrete sine transform (transformation technique in signal processing and data compression)

DWT discrete wavelet transform (transformation technique in signal processing and data compression)

fc fully connected layer (layer that connect every neuron to every neuron in another layer)

FCN fully convolutional network (neural network that only performs convolution operations)

FN false negative

FNN feed-forward neural network (artificial neural network in which the connections between nodes are non-cyclic)

(10)

FP false positive

FPN feature pyramid network (top-down architecture with lateral connections for building high-level semantic feature maps at all scales) GAN generative adversarial network (class of machine learning frame- works in which two neural networks contest with each other in a zero-sum game)

GOP group of pictures (specifies the order in which intra- and inter- frames are arranged in video coding)

GPU graphics processing unit (special stream processor used in computer graphics hardware)

gt ground truth (information that is known to be true) HEVC/H.265 High Efficiency Video Coding (video codec)

HVS human visual system (visual perception of humans)

IoU intersect over union (evaluation metric that can be used to measure the accuracy of an object detector)

ITU International Telecommunication Union (specialized agency of the United Nations responsible for information and communication technologies)

JPEG Joint Photographic Experts Group (image codec) JVET Joint Video Experts Team (joint expert team) LZ Lempel-Ziv (data compression algorithm)

MAE mean average error

mAP mean average precision (metric, used to measure the performance on object detection task)

ML machine learning (application of artificial intelligence in which systems automatically learn and improve from experience without being explicitly programmed)

MLP multi-layer perceptron (class of feedforward artificial neural network)

MOS mean opinion score (subjective vision metric) MPEG Moving Picture Experts Group (video codec)

MSE mean squared error

NAS neural architecture search (technique for automating the design of artificial neural networks)

NN neural network (model for computational processing that is based on connectionism; the term does not disambiguate between biological and artificial neural networks)

(11)

PNG Portable Network Graphics (image codec)

QP quantization parameter (a parameter controlling how aggressively data is quantized)

RDO rate-distortion optimization (method of improving video quality in video compression)

ReLU rectified linear unit (activation function)

ResNet residual neural network (artificial neural network with skip connections)

RGB RGB (Red, Green, Blue color model)

RNN recurrent neural network (class of artificial neural networks with feedback connections)

ROI region of interest (selected subset of samples within data identified for a particular purpose)

RPN region proposal network (backbone of Faster R-CNN)

SGD stochastic gradient descent (iterative method for optimizing an objective function)

SVM support vector machine (supervised learning model in machine learning)

TN true negative

TP true positive

UML Unified Modelling Language (general-purpose modeling language in software engineering)

USC-SIPI University of Southern California’s Signal and Image Processing Institute (image dataset)

VTM VVC test model software (reference software codebase) VVC/H.266 Versatile Video Coding (video codec)

XOR exclusive or (logical operation)

YUV color model consisting of luma and two chrominance components

(12)

1 INTRODUCTION

In a world of an ever-increasing volume of media, such as audio, images, and video, the storage and transmission of data are growing in magnitude. The storage capacity and bandwidth are resources that are limited and costly, and this is why the topic of compressing media has been widely popular for decades to overcome these constraints.

The compression techniques are divided into lossless and lossy compression – lossless compression allows for perfect reconstruction of the original data; lossy compression introduces distortion but also improves the achieved compression rate substantially as a trade-off. Typically, when considering compression, the developed systems have focused primarily on the consumption of the media by people – lossy image and video compression standards such as JPEG [71], AVC/H.264 [72], HEVC/H.265 [83] and VVC/H.266 [13] focus on improving compression with the most negligible impact on the human vision based metrics such as peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) [58].

Historically, these human targeted codecs have been used to compress image and video data used by systems (from now on referred to asmachines) that perform computer vision tasks. However, this approach faces some problems. Most importantly, machine vision differs from human visual system. As the lossy codecs discard perceptually indifferent data, while human consumers would not notice a difference, the performance of the machines could decline. If the machines were trained with image or video data compressed with different settings than the testing (inference time) data, sub-par performance could be expected [42, 98]. Additionally, if the consumers of the data are solely machines, then the image and video data likely contain unnecessary information, inflating overall bitrate, missing on potential compression.

As computer vision techniques have gained tremendous success over the last decades, machines, such as neural networks, are increasingly used as the consumers of the media, and as a consequence, video and image compression for machines have emerged recently [15, 16, 25, 50, 100]. In many applications, although machines are the primary consumers of the media, human involvement is still desirable or even mandatory. For example, this applies to abnormal event detection in videos [18]. In other applications, such as content-based multimedia retrieval [41], machines and humans work together to accomplish a specific task.

The main goal of this thesis is to provide a new image coding technique that targets machine consumption primarily but can also be used for human observation. The goal is

(13)

to surpass traditional image codecs’ performance in computer vision applications (later on referred to as machine tasks) for a given bitrate while retaining the possibility for human consumption (consumption in this context is defined as viewing the image). Other goals are implementing the proposed technique into an end-to-end system and comparing it to the state-of-the-art video codec VVC/H.266 working on all-intra mode, and reviewing relevant theory.

The image and video codecs designated for machine consumption differ from the traditional codecs – instead of human vision based metrics, these typically use metrics more suitable for machine tasks. In this thesis, the proposed system is evaluated on object detection task. Therefore, the objective of the system is to achieve the highest possible performance on this task while minimizing the bitrate. More specifically, the system should achieve high mean average precision (mAP) over a range of bitrates, measured in bits per pixel (BPP).

The ultimate goal of image and video coding for machines is to achieve a higher compression rate with better performance than traditional codecs, without knowing the actual downstream tasks, replacing the traditional codecs. However, the proposed system and the recent literature on the topic focus on exploring different techniques and paradigms while being reliant on the consumer machine performing the specific task. Generality is considered an essential trait of the system, but complete task-agnosticism is left open for future studies.

The rest of the thesis is organized as follows:In chapter 2, the theoretical background of various relevant fields is studied. Then, the proposed method and the implemented system are discussed in chapter 3. After that, the experimental setup and achieved results are displayed in chapter 4. Further discussion and considerations on future ex- tensions to this thesis are examined in chapter 5. Finally, the proposed method, outcome of the experiments and other relevant aspects are reviewed in chapter 6.

(14)

2 THEORETICAL BACKGROUND

In this chapter, theoretical background from relevant fields for the proposed system is studied. The theory is by no means exhaustive – rather, it attempts to introduce the preliminary basics and more detail about the more important concepts. The chapter is divided into four sections. In the first section 2.1, information theory and closely related data compression are surveyed. These topics form the theoretical primitives for the proposed system. Then, a vast majority of the study is focused in the section 2.2, where artificial neural networks are described seeing their essential role in the proposed system. After that, a few computer vision tasks and metrics are introduced in the section 2.3. Finally, after having introduced fundamental basics, relevant image codecs and image metrics are presented, with particular emphasis on residual coding paradigm and differences between systems with distinct end-users, humans and machines. It should be noted that here the term "machine" is used loosely to refer to any machine learning based system, such as systems composed of artificial neural networks, and in the context of this thesis these systems are usually utilized to solve computer vision tasks.

2.1 Information theory and data compression

Already established in 1900s, information theory is a theory that studies quantification, storage and transmission of information in a digital form [1]. It is a branch of probability theories and statistics [45]. The information theory set forth a notable paradigm where transmission of signals from different domains, such as text, audio and images, could be thought to be the same thing – transmission ofbits [1].

A bit is the most widely used measurement unit in information theory [74]. One bit of information allows for a selection over two equiprobable alternatives. In literature,bitrate usually is used to mark the number of bits that are transmitted or processed per unit of time [38]. However, in this thesis bitrate means the volume of used bits per volume of medium. As an example for the given definition, for image data, the bitrate is measured as bits (used) per pixel (BPP).

Another important measure in information theory is self-information or surprisal [34].

When using bits as the measure, the surprisal of a messagemis defined as

I(m) =−log₂p(m), (2.1)

(15)

where p(m) is defined as the probability of a message. The surprisal is plotted as a function of probability of a message in the Figure 2.1. Informally, surprisal measures how surprising a given message is. Given that the probability distribution of a message is known, then surprising message is the most unlikely one. For a simple example, let p(0) = 0.9,p(1) = 0.09and p(2) = 0.01. In this case, a message2122is very surprising as it is highly unlikely to happen, whereas0000is the least surprising message possible since it has the highest probability of occurring. Surprisal is also the optimal code length for any given symbol.

Figure 2.1. Surprisal of a message m with a probability of p(x). The more probable a message is, the less surprising it is.

A central concept in information theory is Shannon entropy, or entropy in short. It was introduced by Claude Shannon in his paper A Mathematical Theory of Communication [77] in 1948, which is regarded as the most influential paper on information theory of all time [1, 45, 66, 74]. For a discrete random variableX, with possible outcomesx1, ..., xn

which occur with probabilityp(x₁), ..., p(x_n), the entropyH in bits is defined as H(X) =−

n

∑︂

i=1

p(x_i) log₂p(x_i), (2.2)

meaning it is the surprisal multiplied by the probability of each element inX. The entropy sets the lower limit on how much a message can be compressed without losing information, i.e., lossless compression [77]. Finding this result had massive implications, as research arose to design systems that attempted to reach this theoretical optimum.

Another important measure is cross-entropy. It is defined as a measure of the disparity between two probability distributions,pand q, for a given random variable. It is also the average number of bits needed to encode a source data with distributionp when using distribution q [64]. It can be formulated as relative entropy of p with respect toq. If the variables are discrete and stem from the same setX, the cross-entropy can be given as

H(p, q) =−

n

∑︂

i=1

p(xi) log₂q(xi), (2.3)

(16)

taking similar form to the general entropy in equation 2.2.

Coding theory, a major application of information theory, is a study that focuses on finding efficient code representations for messages. It is divided into source coding, channel coding, cryptographic coding and line coding, of which source coding, ordata compression, is studied in more detail in this thesis.

To define data compression, one must first define data. For the purposes of this thesis, data is defined to be consisting of elements that can represent bits – making it a quanti- tative entity. Data is something that can be stored and transmitted. Data may be design data – data curated for direct use, such as research purposes – or organic data – side- product data that is formed on normal operation of information systems [9]. While design data is more useful for its intended purpose, organic data usually forms the majority of all available data. Data compression is defined as the study that tries to find ways of minimizing bitrate of data. This can be done by reducing statistical redundancy – repetition of elements – of the data. A fundamental compression system is illustrated in the Figure 2.2. A system that compresses the data into a bitstream is called an encoder, while a system that reverses this process for the compressed data is called a decoder. After reconstruction, the input data, in this casex, is marked with a hat, i.e.,xˆ, to indicate that the data is reconstructed and not original. A system that implements the encoder or the decoder is called acodec.

Figure 2.2. A fundamental compression system.

Compression techniques fall loosely into two groups: lossless and lossy compression methods. These are discussed next.

2.1.1 Lossless compression

Lossless compression techniques attempt to minimize bitrate while retaining perfect quality of the data, i.e., xˆ = x. There are numerous methods that achieve lossless compression by exploiting data redundancy. One of the simplest compression techniques is run-length coding. If an element in a data exists multiple times in a row, the run-length encoding registers the element and the number of repeats. For example, a string of

"AAAABBBBBC" could then be encoded as "A4B5C", halving the total amount of symbols used. Another group of techniques is called dictionary methods. The Lempel-Ziv (LZ) [101] algorithms break the data into source-length pairs which can efficiently represent the already seen (redundant) occurrences of data and encode them with pointers to the positions of past observation. Repetitive patterns are encoded as blocks instead of

(17)

single symbols.

Entropy coding represents a widely popular category of lossless coding techniques which are based on probabilities of symbols in the given data. It is a straightforward scheme based on the entropy of the data. Huffman coding and arithmetic coding are two notable entropy coding techniques. Huffman algorithm is illustrated in the Figure 2.3. It associates all symbols a code according to symbol probabilities such that average code length is minimized. This is done arranging symbol probabilities in increasing order. Then, a conceptual tree-like structure is formed by continuously merging two smallest probabilities into a single node, until all nodes are merged, resulting in a single top node (root). After which, the route traversing the tree down form the assigned codes (for binary trees,0or 1) for each symbol. As can be seen in the figure, the lengthL_Huffman gets fairly close to the optimum (entropy,H) in this specific example.

Figure 2.3. Huffman algorithm for binary sources, example. In the top-left corner, symbols and their probabilities are given. In the bottom-left corner, the first step of the merging is provided. On the right side, the final result is given.

Huffman codes are always within1bit of the optimal code lengths. However, in practical applications this is often not good enough. The knowledge of the probabilities of the symbols might improve, which would then prompt redesigning the Huffman table again.

Furthermore, for extremely small probabilities, the entropy approaches zero, but Huffman average lengths can only approach 1 bit. Arithmetic coding was developed to respond for these shortcomings of Huffman coding. It is fast and reaches better compression ratios. In arithmetic coding the data is encoded into a single floating number q, where 0 ≤ q ≤ 1. A simplified view for the encoding process is presented in the Figure 2.4.

The coded float number gets increasingly more precise with more symbols included in the code. The complete algorithm is introduced in more detail in [49]. Adaptive version of the arithmetic coding would alter the probability distribution of symbols in each iteration.

One such version is Context-based Adaptive Binary Arithmetic Coding (CABAC) [61].

2.1.2 Lossy compression

Whereas in lossless compression the data was to be perfectly reconstructed, in lossy compression this constraint is relaxed, i.e.,xˆ ≈x. With the absent of the restriction, it is usually possible to achieve magnitudes better compression ratios. Since the quality of the data may be compromised, a trade-off between compression and quality arises: increas-

(18)

Figure 2.4. Arithmetic encoding, simplified example. In the top-left corner, symbols and their probabilities are given. In the bottom-left corner, the boundaries for different symbols are assigned. On the right side, the word "bag" is given as an interval between 0.40895and0.41195.

ingly compressed representations of data may be acquired with the cost of degrading the data more. This stems an optimization problem.

The optimization problem is generally referred to asrate-distortion optimization (RDO).

For a total objective to be minimized,J, it can be presented as

J =D+λ·R (2.4)

where D is the allowed distortion, R is the rate or the amount of assigned bits for the data, andλis aLagrange multiplier that determines the trade-off between the objectives.

Based on this trade-off, reaching a low rateRrequires sacrificing quality. Decreasing the value ofλdrives the system to compress less with the benefit of better performance with relation to the distortion function. While rate and distortion are the principled objectives in most cases, low complexity and fast speed are usually additional objectives that are set for a compression system [90]. When considering the end-users, an RDO that is optimized for humans might also consider perception as a part of the RDO [8]. For example, human auditory system can not detect perceptually inaudible sounds. Therefore, a compression system can discard these sounds without affecting the perceptual quality [2].

The core method to lower rate but increase distortion is calledquantization. In the most simple terms, it means mapping values from a source of data into another, smaller tar- get data in, to name a few, temporal, spatial, channel-wise, numerical or frequency-wise realm. [81] An example of temporal and numerical quantization by a factor of2is given in the Figure 2.5. All of the terms used here are not established in the literature. For example, spatial quantization is usually referred to asdownsampling. Arguably, it is charming to have different operations on different realms under the umbrella of a single term, since they essentially are about the same matter.

Decorrelation is important process in many compression systems. It compresses the energy of the signal (data) into few significant transform coefficients, increasing its com- pressibility in entropy coding [95]. This is illustrated with an example histogram pair in the Figure 2.6. The left image is the histogram of the original data, the right image is the histogram of the decorrelated data. As can be seen, the decorrelated data has most of its

(19)

discrete sequential data

temporal axis

numerical axis

numerically and temporally quantized discrete sequancial data

Figure 2.5. Quantization over temporal and numerical axis, example. Left: Original example data. Right: Temporally and numerically quantized data.

probability mass concentrated on a small range. This behaviour significantly increases the efficiency of a probabilistic entropy coder, which can now assign really short codes for the vast majority of the data.

Figure 2.6. Histograms of unspecified data before and after decorrelation, example.

Left: Histogram of original data. Right: Histogram of decorrelated data. H_left> H_right

Transform coding techniques are common in many codecs that transform data from its original domain to another domain to decorrelate the data [35]. Discrete cosine transform (DCT) is a widely used transform coding technique, featured in several codecs in various domains such as JPEG, VVC/H.266 (these are discussed in the section 2.4) and MP3 [12]. It transforms the data into its frequency base, expressed with a sum of cosine functions. Contrary to closely related Fourier transforms which operate with complex numbers, the transform retains the data in real number space. Another commonly used transform coding technique isDiscrete Wavelet Transform(DWT) [65], featured in JPEG- 2000 [60]. DWT exploits both spatial and frequency domain redundancies with transforms of mother wavelet on the input data. It divides information of data into approximation signal and auxiliary detail sub-signals, the count of which can be increased to enhance quality with the cost of reduced compression efficiency.

With lossy compression comes distortions to the data. These are usually calledartifacts.

Different compression techniques generate a wide range of artifacts, though many are shared by multiple techniques. Maybe the most commonly known artifact in visual codecs is tiling [54], or block boundary artifact – degradation of the image data which makes it look like it consists of rectangles. An example of this artifact produced by DCT with very low quality setting is given in the Figure 2.7. Although the compressed image in the figure is still fully recognizable, the artifacts are very apparent for a human observer.

(20)

Figure 2.7. Blockiness, example image "peppers". Left: Original image. Right: Highly compressed image. The image is part of the USC-SIPI image database [93].

2.2 Artificial neural networks

Artificial neural networks (ANN) are topical computational models inspired by biological neural networks in human brains. In this section, a throughout view of the ANNs is given. First, the definitions and terminology are discussed in the subsection 2.2.1. Al- though ANNs are the primary focus, a more wide viewpoint is asserted subsequently in 2.2.2 which considers the history of machine learning (ML). ANNs consist of artificial neurons and the mechanisms in which they are interconnected, thus these building blocks are considered in the subsections 2.2.3 and 2.2.4, respectively. Like any machine learning system, ANNs are systems that learn to perform a task without being explicitly programmed to solve the task. There exist three major learning paradigms for ANNs: supervised learning, unsupervised learning and reinforcement learning. Since the system proposed in this work uses the first, a supervised learning process is described in detail in the subsection 2.2.5.Convolutional neural networksare the most common architecture type for ANNs dealing with visual data, and are also present in the proposed system. As such, they are discussed in the subsection 2.2.6. Finally, as the preliminaries have been discussed, Faster R-CNN and autoencoders are of so crucial significance to this work that they deserve their own subsections in 2.2.7 and 2.2.8.

2.2.1 Definitions and terminology

The field of machine learning and especially the field of deep learning contains countless terms and buzzwords that have considerable similarities and overlappings. Therefore, it is reasonable to define the used terms first. Machine learning is the study of computer algorithms that make a system learn with data to perform specific tasks automatically without being explicitly programmed. Artificial intelligence (AI) has a more relaxed definition – it is intelligence demonstrated by machines. Machine learning is a subset of AI.

(21)

Moving into more specific terminology, deep learning(DL) is a subset of machine learning, which studies artificial neural network based systems with multiple layers [76]. Con- gruent term with deep learning, deep neural networks (DNN) are deep artificial neural networks. The threshold is usually defined as at least 3layers. Nowadays, most ANNs used in applications are DNNs; likewise, such is the case in the proposed system. Thus, the more common term ANN is used in this thesis without the distinction between these two terms.

ANNs consist of artificial neurons. Here, a distinction between an artificial neuron and sometimes usedperceptronis made. Perceptron is defined as a unit with weighted inputs that produces a binary output based on a threshold. Artificial neuron, however, is not constricted to a binary output. Another commonly used term in literature is multi-layer perceptron, which is sensibly defined as consisting of perceptrons in multiple layers. Yet, it is often used as a synonym for ANNs in literature. In this thesis, the term ANN is used in each instance.

2.2.2 History of machine learning

Machine learning is a subset of artificial intelligence. Therefore, the history of machine learning also contains the history of artificial intelligence. Throughout times, there have been numerous different approaches in developing AI and ML algorithms and systems.

These can be classified into two paradigms: symbolism and connectionism [9, 34]. Sym- bolic paradigm solves tasks with a top-down approach, having machines manage sym- bolic representations of the world with logical rules. Therefore, it is sometimes called feature engineering. Connectionism embraces a bottom-up approach, where intelligence is formed from interconnected computational units; where a combination of these units form functioning systems that are capable of solving tasks. Hence, term feature learning is used as well.

Both paradigms have had their periods of success, "golden eras", and disinterest, "AI winters". After the formulation of the theory of computation in the 1940s, a serious research started to pursue the development of machines that could display intelligent behaviour.

In the 1950s, the AI attracted large quantities of interest from the scientific community, but also from a wider audience. In the 1960s, serious research interest tarnished due to lack of computational power and poor performance of the developed systems, to a point where intelligence demonstrated by machines was mainly considered in science fiction only. [9]

The invention of multi-layer perceptrons in the 1980s lead to a new spark in the interest towards the field, since these systems were able to demonstrate superior performance on multiple tasks that were previously unsurpassable. Further increase in computational power, combined with improved algorithms and expanded volumes of data sets [34, 85]

allowed for deeper models to function in practise. These factors facilitated the rise of artificial neural network based systems in the 2000s. [9]

(22)

The early deep models with really convincing results were convolutional neural networks (CNN). This was mostly due to CNNs being able to be more computationally efficient than other types of ANNs, given the CNNs generally consist of only a fraction of the parameters of equally deep fully connected architectures. The rise of the CNNs attracted further attention to the field of deep learning, allowing the interest on various other types of ANNs and techniques to blossom later on as well. [34] Nowadays, deep learning has become the most compelling area of research in machine learning.

The important role that commercial demand has played in the rise of machine learning can not be overlooked. The rate of research in the field has been made possible mainly by economic interest: the deep belief within the funding companies, that the investment in the research would pay itself back at some point. Likewise, the AI winters have occurred mainly for this exact reason – lack of demand. Notable AI/ML programs, such as IBM’s Deep Blue, and Google’s AlphaGo, although just the tip of the iceberg, have had important role in arousing public interest and increasing attention to the field [36]. The current phase of fast-paced developments in machine learning and especially in deep learning is a result of massive investments by giant technology companies, like Facebook, Google and Microsoft, that have concentrated large quantities of research and development bud- gets to the fields for over a decade now. Although the state-of-the-art systems have undoubtedly progressed for the better during this time, the research is usually conducted primarily to benefit the companies, blurring the line between academia and industry. [9]

2.2.3 Artificial neuron

Artificial neural networks consist of artificial neurons and connections between them.

These are typically stacked in neural layer (orlayer in short) structures – structures with possibly thousands of artificial neurons in parallel. An artificial neuron is a computational model, loosely based on neurons in human brains.

Neurons in human brains are cells that communicate with other neurons by transmitting electrochemical signals. A typical neuron consists of a cell body, dendrites and an axon.

Neurons receive signals with dendrites and the cell body, and transmit signals with the axon. Normally, neurons do not transmit signals most of the time – usually, only 1- 4%

of them are active at a time [52]. Instead, neurons have electrical threshold potentials: if their voltage changes with a high enough amount over a short interval, the neuron fires an all-or-nothing (binary) electrochemical pulse, the so-called action potential.

Artificial neurons (from now on written only asneuron) have a few similarities to biological neurons, but are purely mathematical constructs. A model of a neuron is depicted in the Figure 2.8. A neuron can have one or more inputs. These are multiplied, or weighted, with learnable parameters (or terms) calledweights. A neuron also contains abias term which is a special weight that does not have an input assigned to it – it is transmitted as a constant. All weighted inputs are then summed together. The output of the summation is further transmitted to anactivation function (transfer function in some literature). The

(23)

used activation function is a design choice in an ANN system – these are explained in more detail in the end of this section. Finally, the output of the activation function is the output of the neuron. The total mathematical model for a neuron is thus computed as

y(X) =φ(

n

∑︂

i=0

w_ix_i), (2.5)

wherex_iandw_iare thei^thinput parameter and weight, respectively,φ(·)is the activation function andy(X)is the output for the input vectorX. Although the outputyis in singular, it may be used as an input to multiple neurons.

Figure 2.8. Artificial neuron model. Data are written in italics; operations are written in bold.

Considering a practical application, it is worth to note that per neuron, only the weight terms are saved into computer memory – other information is saved for larger units, e.g., activation function information is saved for each layer. Therefore, as the amount of inputs in a neuron are 1:1 to the number of weights, the connectedness of neurons is a major factor affecting a typical ANN model size. Weight terms are trainable, meaning that when the ANN model is being trained, these parameters are updated via backpropagation, which is covered in more detail in the subsection 2.2.5.

Activation functions

The choice of activation function depends on the desired properties of the neuron and the ANN. Identity activation function, i.e. φ(x) = x, effectively does nothing, so the output is simply a linear combination of the weighted input parameters. However, the applied activation functions usually initiate a non-linearity, meaning that a direct mapping from input to output can not be done without knowing the other input parameters. This property allows ANNs consisting of these neurons to form non-linear outputs, which in turn allow for computing nontrivial problems, such as XOR logic, which are impossible to solve with linear models [26, 34]. Next, three activation functions are introduced: tanh, ReLU and LReLU. They are visualized in the Figure 2.9.

A hyperbolic tangent (tanh) function was initially the go-to choice for an activation function

(24)

Figure 2.9. Tanh, ReLU and LReLU.Left: Hyperbolic tangent (tanh). Middle: Rectified linear unit (ReLU). Right: Leaky rectified linear unit (LReLU) with negative slopec= 0.1.

in deep learning. It is defined as

φ(x) = tanh(x) = e^x−e^−x

e^x+e^−x. (2.6)

The tanh function achieves non-linearity. However, ANNs that use tanh as the activation function suffer from a vanishing gradient problem when training with gradient-based learning methods and backpropagation, since the slope (_∆x^∆y) is close to zero outside the proximity of the origin [34]. Henceforth, the current default in ANNs is considered to be rectified linear unit (ReLU) [34]. This transition occurred after the authors in [32]

published results that demonstrated that the models trained with ReLU showed superior performance. ReLU is a simplistic rectifier activation function, defined as

φ(x) = max(0, x). (2.7)

Since ReLU is very similar to linear functions, it maintains the properties that make linear models easy to optimize with gradient-based methods. Nonetheless, there are some downsides to ReLU activation functions – most notably, the "dying ReLU" problem, which refers to scenarios where a large number of neurons using ReLU activation function only output values of y = 0. When most of the neurons return output of zero, the gradients can no longer flow with backpropagation method and the weights will not get updated.

Eventually, a large portion of the ANN stops learning further. Additionally, as the slope of ReLU in the negative input range is also zero, the neuron will not recover. To solve this problem, aleaky rectified linear unit(LReLU) [59] can be used instead. It is defined as

φ(x) =

⎧

⎨

⎩

cx, x <0 x, x≥0,

(2.8)

where c is the slope term for the negative part of the function. c = 0.01 is a common value, but other values may be used as well. LReLU maintains the upsides of ReLU activation function, but since the slope is not zero on the negative input range, neurons that are stuck on the negative output range can recover via subsequent weight updates.

Hence, in many cases it is chosen as the activation function in ANN systems.

(25)

2.2.4 Types of ANNs

ANNs consist of neurons and connections between the neurons – therefore, ANNs are directed, weighted graphs. The way these connections are organized varies with different ANNs. Yet, the most popular way of organizing them is in layers. Individual layers may have specific properties. The terminology goes as follows: the first layer is called an input layer, the last layer is called anoutput layer and all layers between the input and the output layers are calledhidden layers. These are usually numerous in state-of-the-art systems. The more there are hidden layers, the "deeper" the model is. [34] The starting inputs are external data, such as audio or images, or even the output of another ANN in the case of the system consisting of multiple ANNs in a pipe. The final outputs can be any data, such as modified images, or a single number – the output may or may not have any similarity with the input data. After obtaining, the output is used to accomplish a predefined task.

There are numerous different approaches to implementing ANNs. The most essential classes of ANNs are feedforward neural networks (FNN) and recurrent neural networks (RNN). In FNNs, the connections between neurons are acyclic, meaning that the connections can not go backwards; in RNNs, connections between the same or previous layers are allowed. In both cases, it is conventional that neurons in a layer l_i are connected to neurons in layerl_i+1 (for RNNs, also l_i andli−1). Therefore, there exists a hierarchy within the layers. [34] Layers close to input can be calledinitial layers; layers close to the output can be calledfinal layers. This distinction, however, is arbitrary.

The connection patterns between two layers vary as well. Fully connected (fc) layers have connection from every neuron in one layer to every layer in another layer. Full connectedness means that every neuron in a layer can access information of every neuron in another layer, but this can get computationally costly, as there areN·M weights, where N andM are the amount of neurons in the former and the latter layer. The connections may also be sparse, meaning that neurons in one layer form only few connections to neurons in the other layer. Convolutional neural networks (CNN) are notable for having sparse connections between layers – CNNs are covered in more detail in the subsection 2.2.6. Apooling layer is a layer in which larger body of neurons is mapped into a smaller one, reducing the number of neurons in a layer. [34]

2.2.5 Generic supervised learning process

Learning paradigms for ANNs are usually divided into supervised, unsupervised and reinforcement learning. The proposed system uses supervised learning process, but all the paradigms are covered briefly.

In supervised learning the training set is usually partitioned into non-overlapping training, validation and testing data. When training, the system is provided with desired input- output pairs of the training data, where the outputs are ground truths (gt). With the

(26)

inputs, the system makes a prediction of the desired output. The error between the prediction and the ground truth is usually measured by aloss function(sometimes called cost function), which is an essential part of the system – loss functions are discussed in more detail later in this subsection. The training-time performance of the system is measured with performance with validation data. The final performance of the system is ultimately tested with the testing data.

Unsupervised learning differs from the supervised learning by the absence of desired output (gt). The objective is usually to have the system self-organize the data in a desirable way; this is the basis for solving tasks such as a clustering task. Semi-supervised learning is a combination of supervised and unsupervised learning, where only a portion of the input data have gt associated with them. [29]

Unlike supervised learning, reinforcement learning does not have input-output pairs, but a set of possible actions A in an environment E, where a number of states S have re- wardsRassociated with them. The system (often called agent in reinforcement learning) tries to find policies that perform optimal actions in a given environment to maximize the cumulative reward. [29]

Generic supervised training process for ANNs is presented in the Figure 2.10. The process starts by initialization of weights of connections of neurons (i.e., for a single neuron, see the Figure 2.8). There are many initialization schemes, most of which use random- ization over some distribution [47]. Then, the actual iteration starts by feeding input data to the input layer of the ANN. The inputs are usually fed in batches, meaning the input data contains multiple instances, e.g., for image data multiple images belong to one batch. These are propagated through the ANN layers until the output layer is reached. In the output layer, a loss is calculated based on the prediction, ground truth and the loss function. The loss is usually reported since it is an essential measure of the system’s progress during the training.

After calculating the loss, derivatives of the loss with respect to the weights of all neurons in the ANN are calculated. The differentiated losses are then backpropagated – backpropagation (BP) is considered in more detail later in this subsection. All weights are updated depending on the backpropagation and a learning rate, which determines how much the weights are updated – a higher learning rate results in greater updates. Usu- ally, an optimizer adjusts the learning rate dynamically – this is also explained in more detail later on. After updating the weights, the iteration is repeated until the training is stopped, either internally (program running the ANN reaches a stopping condition, such as certain number of iterations) or externally (user stops the program running the ANN).

A relevant measure for training amount is called an epoch– one epoch corresponds to one complete processing of the whole dataset by the ANN. For example, if the dataset contains8000images and a batch size of8is used, then one epoch corresponds to1000 iterations of the described training process.

The training of ANNs usually consumes plenty of computational resources and time, to

(27)

Figure 2.10. Generic ANN supervised learning process. The process reports losses and updates the ANN weights every iteration based on the inputs.

the point where powerful GPUs are usually necessary to conduct experiments within a

"reasonable time" – typically hours or even days. Even so, the success of the trained system depends on numerous factors, such as the difficulty of the task, the choice and architecture of the system and the selected hyperparameters, such as learning rate and batch size. These choices are usually made on the basis of experience and intuition – although systematic search is theoretically possible, tests are expensive in processor time and thus exhaustive search is impractical or even impossible. [34]

In addition to adding multiple GPUs for parallel computing, batching of the input data also speeds up the training. Larger batches mean less iterations, as can be deduced from the example (with8000images). In practise, with a reasonable ANN a typical training process would take from tens to thousands of epochs, although the amount greatly depends on numerous factors. However, simply training the ANN for longer does not automatically mean that it will perform better due tooverfitting. This phenomenon is explained in more detail later in this subsection.

Loss function

In ANNs, a loss function maps the difference or error between prediction and ground truth to a numerical value, called loss. Therefore, if the loss function is selected correctly with relation to the task performance (or objective), the system is optimized by minimizing the loss.

(28)

The most straightforward loss function is the mean average error (MAE), orL1 loss, L_MAE= 1

n

∑︂

i=1

|g_i−y_i|, (2.9)

where g_i and y_i are ground truth and prediction for i^th sample, respectively, and n is number of samples. It measures the absolute error of the prediction with relation to the ground truth value. However, many systems benefit from assigning extra penalty for coarse mistakes. Therefore, mean squared error (MSE), or L₂ loss, is usually the preferred loss function over MAE. MSE is defined as

L_MSE = 1 n

n

∑︂

i=1

(gi−yi)²; (2.10)

it measures the squared error of the prediction with relation to the ground truth value.

More specific loss functions to the proposed system are introduced in the subsection 2.3.2.

In some systems, the loss function may consist of multiple loss terms. For example, a system dealing with the rate-distortion optimization problem (discussed in the subsection 2.1.2) would need to have arate loss anddistortion loss terms, possibly weighted by a Lagrange multiplier,λ, to reach a satisfactory balance between the losses. Learning with multiple loss terms is calledmulti-task learning[33]; the combined loss function is called multi-objective loss function.

Backpropagation

Backpropagation (BP) is used to relay information about the prediction error to all neurons in the ANN, and update all weights in a manner which ideally reduces the error for subsequent iterations. In practise, the BP computes the gradient of the cost function with respect to the weights by the chain rule, one layer at a time, in a backward order. There are various ways to perform the weight update; one popular method is via stochastic gradient descent (SGD) [10].

Gradient descent is an iterative optimization algorithm for finding minimum of a differ- entiable function. All possible neuron weights can be thought to form an n-dimensional space where certain weight combinations are more advantageous than others, e.g., reduce the loss function score. The figure 2.11 (a) illustrates this for two weights with a contour plot, for unstated loss function that can be minimized to zero with certain weight values.

The algorithm iterates steps in the opposite direction of the gradient of the loss function at a given point, resulting in weights that yield a smaller loss. Initial weight combination is marked withW⁽⁰⁾, subsequenti^th iterations’ weights are marked with W⁽ⁱ⁾. Mathemati-

(29)

Figure 2.11. Gradient descent, example. (a) Gradient descent for two weights with a convex environment. (b) Demonstration of non-convex problem, where multiple local minimums exist for a single weight wi. Weight update term directions are marked for each slope with an arrow. Portions with different local minimums are separated with dotted lines.

cally this update is stated as

W⁽ⁱ⁺¹⁾ =W⁽ⁱ⁾−γ∇L(W⁽ⁱ⁾), (2.11)

whereγis step size, or learning rate, and∇L(·)marks the gradient of the loss function. It needs to be noted, that for the algorithm to work properly, the step size needs to be small enough – otherwise, the weight updates would be too great, and the obtained weights would strongly oscillate around the optimal values. However, if the step size is too small, the weights might take unnecessarily many iterations to converge on the optimal solution.

It should also be emphasised that many practical optimization problems are not perfectly convex, e.g., there are many local minimums – this is demonstrated in the Figure 2.11 (b). In this case the gradient descent algorithm could get stuck to a local minimum which could be problematic.

There exists many gradient descent based algorithms that increase the chance to escape local minimas. [75] Stochastic gradient descent is the same algorithm as the formerly described gradient descent algorithm, except instead of calculating the gradient for the whole batch, the algorithm selects only one sample from the batch and evaluates the gradient based on it. This makes the algorithm computationally more attractive, and introduces more randomness to the training process, making escaping local minimas more likely.

The proposed system uses Adam optimizer (adaptive moment estimation) [43], which is a momentum based extension for the SGD. In momentum (running-average) based methods the learning rate is adapted for each parameter with a momentum, which allows the weight adjustment to depend on the previous change to some degree – a momentum

(30)

close to 0 emphasizes the gradient, while a momentum close to 1 underlines the last change. In Adam, momentum for both gradients and the second moments of the gradient are used. Adam’s weight update for a single weight is given by

w⁽ⁱ⁺¹⁾=w⁽ⁱ⁾−γ√mˆw

v

ˆwϵ, (2.12)

whereγ is the learning rate, mˆ_w is the running average of the moment of the gradients, v

ˆwis the running average of the second moment of the gradient, andϵis a small positive scalar to avoid division by zero. Furthermore, the moment of the gradients is defined as

mˆ_w= 1

1−β₁ⁱ⁺¹(β₁m⁽ⁱ⁾_w + (1−β₁)∇_wL⁽ⁱ⁾) (2.13) and the second moment of the gradients is defined as

v

ˆ_w= 1

1−β₂ⁱ⁺¹(β₂m⁽ⁱ⁾_w + (1−β₂)(∇_wL⁽ⁱ⁾)²), (2.14) whereβ₁andβ₂are forgetting scalars for the gradients and second moments of gradients, respectively. Note that the absence of brackets inβⁱ⁺¹_j means that the scalar is raised to the power ofi+ 1, not that it is the value duringi^th iteration – the forgetting scalars are constants. Typical choices for these parameters areϵ= 10⁻⁸,β₁ = 0.9and β₂ = 0.999.

The Adam algorithm is computationally efficient and has been empirically proven to work well with a wide range of systems. [43]

Overfitting and regularization

A fundamental challenge in machine learning is that the trained system needs to achieve high performance on unobserved data. A system that can not generalize well on other data than training data is in most cases useless. [34] This is the reason why the data is split into training, validation and testing sets, as previously mentioned. The training phase performance is measured with validation data, and the final performance is measured with the testing data. This arrangement ensures that the system’s generalization capability is taken into account when evaluating the performance.

Generalization errors are divided into two categories. Underfitted models have learned a too simple representation of data; overfitted models have learned too complex representation of data. An example is provided in the Figure 2.12. When fitting a model to the data, Occam’s razor principle can be applied. The principle states that of competing hypotheses that explain a phenomenon, or data in this case, the simplest explanation should be chosen. In the example, both well-generalizing and overfitting models explain the data well, i.e., the errors between the samples and the model are fairly small. How- ever, the well-generalizing model is simpler, thus, should be preferred over the overfitting model. [34]

Typical ANNs could contain millions of weights, meaning that they are capable of produc-

(31)

Figure 2.12. Different fits, example.Left: A linear model is not capable of representing the true properties of the data. Middle: 4^th order polynomial is a good fit for the data.

Right:10^thorder polynomial is too complicated model for the data, thus the model did not capture its true characteristics.

ing really complicated models. Therefore, overfitting becomes a characteristic problem.

The methods that combat overfitting are calledregularization methods.

Apart from the amount of weights in the model, there are also other factors that cause overfitting. At the start of the training, the model performs poorly on both training and validation data. As the training proceeds, a usual pattern is that performance on both datasets increases. However, while the performance with the training data normally keeps improving, the performance on the validation data starts usually declining at some point. This is empirically found to be the point in training that indicates that the system has started to overfit. Hence, the simplest way to combat this type of overfitting is to stop the training when the performance on the validation data starts declining.

Another factor that plays huge role on overfitting is the amount of data. The larger the training dataset is, the less likely the trained system is to overfit. This is due to the larger dataset representing a larger population of all the possible data of the domain, reducing the proportion of the sampling noise of the total information available. There are also many other regularizing methods, such as introduction of penalty for large weights (l₁-regularization) or for excessive amounts of non-zero weights (l₂-regularization), and specifically to ANNs, dropout methods where some portion of the neurons are deactivated randomly during training. [82]

2.2.6 Convolutional neural networks

Convolutional neural networks (CNN) are specialized ANNs for data with a grid-like topol- ogy, such as images. CNN layers employconvolutionoperation instead of general matrix multiplication. The operation differs from the mathematical definition for convolution; in addition, there exists multiple implementations that perform the convolution in principle but vary slightly. The typical implementation is actually cross-correlation operation, but

(32)

here it is called convolution in accordance with the literature. For 2-dimensional data, the operation can be defined as

Y(i, j) = (K∗X)(i, j) =∑︂

m

∑︂

n

X(i+m, j+n)K(m, n) (2.15)

whereX(i, j)andY(i, j)are the input and output data ini^throw andj^thcolumn, respectively, and K is an M ×N kernel (sometimes called filter). The output data of a CNN layer is usually called afeature map. The kernel contains weights that are learned with training of the network. Typically, the kernel dimensions are much smaller than the data dimensions. [34]

CNN is a widely popular architecture choice especially for systems with natural image input data. There are several reasons for this – though sparsity is the greatest concern:

A typical image data consists of thousands or even millions of parameters (pixels). This large volume of parameters would be computationally extremely expensive with fully connected layers. However, natural images are spatially contiguous at least to some extent [81], and consist of instances on certain locations that can be thought to be consisting of smaller instances within those locations, etc. For example, a portrait image consist of a person; within the person, face is located; within the face; eyes are located; within the eyes, various shapes are located; within the shapes, various edges and other primitives are located. This is visualized in the Figure 2.13.

Figure 2.13. Natural image composes of recursive instances, example. The original image, "Lena", is from the USC-SIPI image database [93].

Convolution operates only on the region of the kernel – therefore, only local pixels are taken into account when feeding forward to the next layer. This reduces the amount of weights dramatically, and achieves sparsity. The neurons on the initial layers are able to

"see" only the most primitive details. However, with further layers, the total reception field increases, and the final layers obtain details from the whole image, therefore allowing observing of higher level features. This is also illustrated in the Figure 2.14 (right), where the inputx3has a sparse interaction withw21andw22in the next layer.

The figure represents a typical CNN layer. After the convolution phase, an activation function is applied (these were discussed in the subsection 2.2.3). Finally, the CNN layer may include a pooling layer. Pooling layers reduce the resolution of the feature maps by

Enhancing Image Coding for Machines with Compressed Feature Residuals