Color Constancy with Small Dataset via Pruning of CNN Filters

(1)

Color Constancy with Small Dataset via Pruning of CNN Filters

Examiners: Professor Moncef Gabbouj Faculty of Information Technology and Communication Sciences (ITC) Master’s thesis December 2020

(2)

Sahar Husseini: Color Constancy with Small Dataset via Pruning of CNN Filters Master’s thesis

Tampere University

Master’s Degree Programme in Advanced studies in Machine Learning Examiners: Professor Moncef Gabbouj

December 2020

Color constancy is an essential part of the Image Signal Processor (ISP) pipeline, which removes the color bias of the captured image generated by scene illumination.

Recently, several supervised algorithms, including Convolutional Neural Networks (CNN)-based methods, have been proved to work correctly on this problem. How- ever, they usually require a suﬀicient number of annotated data that reflects the complexity of realistic photometric effects and illumination in real scenes. It is time-consuming and costly to collect many raw images of various scenes with different lighting conditions and measure corresponding illumination values.

The transfer learning technique, whose principal research focuses on collecting the knowledge gained when solving a problem and utilizing it to different but related problems, answers the need for large data. i.e., we transfer features from a CNN trained on a source task with a large scale dataset to the color constancy task.

Most modern state-of-the-art CNN models are basically designed for the image classification task and are focused on training with deeper structures. One of the main disadvantages of deep convolutional neural networks is that they suffer from van- ishing and exploding gradients. These deep structures also tend to overfit the data.

Moreover, too many convolutional filters in these CNN models are not always ben- eficial to the networks, negatively influencing accuracy due to the many useless features. When it comes to the target task with small datasets, this redundancy is even worse.

To reduce the dependence on a large scale labeled dataset and take advantage of standard and famous CNNs architectures, we proposed an approach to creating an eﬀicient color constancy algorithm. Firstly, we utilized a structure channel pruning method named network slimming to thin our baseline model. It directly forces sparsity-induced regularization on the scaling factors in batch normalization layers, and less important channels are automatically distinguished during training and then pruned. Thereby, we iteratively pruned 75% channels of a specific Mobilenet version used as our model’s backbone, trained on a large-scale classification dataset.

It means the backbone with the classification head is used to deal with our network pruning task. Experimental results show that our proposed model can eﬀiciently

(3)

Then the resulted compact model was transferred and trained on a small dataset doing color constancy. During training on the color constancy task, we applied the DSD technique. It regularized the network iteratively by learning connections’ im- portance during the initial dense solution and pruning the unimportant connections.

Then, the pruned connections are recovered, and the whole network is retrained again. The proposed method reaches comparative performance with other state-of- the-art models, produces fewer MACs, and can significantly decrease computational costs.

Keywords: Color constancy, Transfer learning, Deep learning, Convolutional neural network, Network pruning, Optimization

The originality of this thesis has been checked using the Turnitin Originality Check service.

(4)

AI Artificial Intelligence ANN Artificial Neural Network AO Average Overlap

APoZ Average Percentage of Zeros BoCF Bag of Color Features BoF Bag of Features CC Color Constancy

CCC Computational Color Constancy CNN Convolutional Neural Network DNN Deep Neural Network

DSD Dense Sparse Dense

CIE International Commission on Illumination CWP Confidence Weighted Pooling

DNN Deep Neural Networks EAO Expected average overlap ISP Image Signal Processor GD Gradient Descent

GPU Graphics Processing Unit

GOP Generalized Operational Perceptron HVS Human Visual System

HypNet Hypotheses Network IFM Input Feature Map

IR InfraRed

M Middle

MAC Multiply Accumulate Computations MCC MacBeth Color Checker

MCDE Monte Carlo Dropout Ensembles OFM Output Feature Map

PCA Principal Component Analysis

(5)

S Short

SelNet Selection Network

SGD Stochastic Gradient Descent SPD Spectral Power Distribution

SD Semi-Dense

L long

LC Log-Chroma

(6)

This master of science thesis was carried out in 2019–2020 at the Tampere Uni- versity’s computing sciences department as a part of the Intel Corporation project.

First of all, I would like to express my faithful gratitude to my supervisor, Prof.

Moncef Gabbouj, whose expertise was invaluable in formulating the research ques- tions and methodology. His insightful feedback pushed me to sharpen my thinking and brought my work to a higher level.

I want to thank the multimedia research group members, specially Firas Laakom, for their supports. I want to acknowledge my colleagues, Antti Stenhäll, Tapio Finnilä, and Lasse Lampinen in Intel Corporation for their supports during my internship.

I would especially like to express my deepest appreciation to my family for all the unconditional support, patience, and encouragement. I could not have done this without them. I also want to thank my dear brother Kurosh, for his kindness and sympathetic ear during these years.

Finally, I want to extend my sincere thanks to my dear friends: Riikka Oksanen, who always comforts me to know that she is there for me, Yasmin Zhinaka, who keeps me laughing, and Marjo Karnaatti for all the great days that we spent together in the Tampere University.

(7)

1 Introduction . . . 1

2 Theoretical background . . . 4

2.1 Color theory and image formation . . . 4

2.1.1 Illumination . . . 4

2.1.2 Reflectance . . . 6

2.1.3 Visual system response . . . 7

2.2 Color space . . . 9

2.2.1 RGB color space . . . 10

2.2.2 XYZ color space . . . 11

2.2.3 Chromaticity diagram . . . 12

2.2.4 XYZ-RGB color space conversion . . . 13

2.2.5 Color chart . . . 15

2.2.6 Gamma correction . . . 17

2.3 Color constancy problem formulation . . . 18

2.4 Image correction . . . 19

2.5 Color constancy methods . . . 20

2.5.1 Static methods . . . 20

2.5.2 Gamut based methods . . . 22

2.5.3 Learning-based methods . . . 26

3 Related work . . . 28

4 Optimization of deep learning models . . . 31

4.1 Network architectures . . . 32

4.2 Optimization . . . 33

4.2.1 L1, L2-Regularization . . . 35

4.2.2 Dropout and DSD . . . 36

4.2.3 Batch normalization . . . 38

4.3 Fundamental of network pruning . . . 39

4.3.1 Pruning levels and criteria . . . 40

4.3.2 Filter pruning . . . 42

4.3.3 Filter pruning of residual blocks . . . 43

4.3.4 Network slimming . . . 44

5 Experimental result . . . 47

5.1 Color constancy dataset . . . 47

5.2 Angular loss function . . . 48

5.3 Implementing details . . . 49

(8)

(9)

1 Introduction

Color is an essential part of visual information. In the Human Visual System (HVS), the procedure of color formation is as follows: 1) the visible light interacts with objects in the scene. 2) the objects reflect a part of the light. 3) the reflected light interacts with the human eyes and constructs the object’s color in the human brain [14, 44]. The processing pipeline of a digital camera transforms the raw data captured by the sensor to a representation of the original scene that must be close as possible to what a human observer would have observed if placed in the original scene. However, the digital camera sensors are not accurate and do not encode colors the same way the human visual system does. The colors visible in images are determined by the intrinsic characteristics of objects’ surfaces and the color of the illuminant. Thereby the color of images is easily changed by different illuminants and might appear ”blueish” under daylight and ”yellowish” under indoor light. Since most computer vision applications such as tracking, object detection, and segmentation require only the object’s intrinsic features, they expected that input images be correctly color-unbiased. This ability to observe moderately constant colors when different illuminants lit objects is named as color constancy. [44, 60].

HVS resolves this task through a complex mechanism that includes color adaptation, color memory, and other human vision characteristics. However, the modern camera module systems do not automatically compensate for the illuminant colors.

To address this problem, different Computational Color Constancy methods (CCC) [19, 94] have been proposed to emulate the dynamical adaptation of the cones in the HVS. CCC has two steps to eliminate the color cast from the image: estimation of the light source color and color correction. After illumination estimation, which is the main focus of this thesis, color correction algorithms transform image pixels such that the corrected image appears to be taken under a canonical light source [44].

Illuminant estimation for a single image captured with a regular digital camera is an ill-posed problem since both the intrinsic characteristics of a surface and the light source’s color have to be estimated, while only the actual image pixels are known.

Therefore additional simplifying assumptions such as color distribution within an image or restricted gamuts are needed to solve this problem. Despite the high im- portance of this problem, a universal solution still has not been discovered. Methods of CCC may be separated into three big categories: static methods, gamut based methods, and machine learning-based methods.

In recent years, the growth of machine learning techniques and particularly Convolu- tional Neural Networks (CNNs) facilitated more accurate CC-methods. Considering

(10)

that most original CNNs designed particularly for color constancy are quite simple and composed of only a few layers, we propose improving their eﬀiciency using a more powerful network utilizing optimization techniques, widely used in deep learning. Deep learning-based algorithms have reached state-of-the-art performance in many computer vision tasks, e.g., image classification [4, 110], object detection [3], semantic segmentation [31, 109]. Large-scale datasets, new target platforms, improved network architectures, and deeper neural networks allow the development of unprecedented large CNN models. However, the deployment of CNNs in real-world computer vision applications are often constrained by:

1. Model size: CNN’s’ strong representation power comes from their millions of trainable parameters. With network structure information, these parameters need to be stored on disk and loaded into memory during inference time.

2. Run-time memory: Through inference, the intermediate activations of CNNs layers with a batch size of one could also take more memory than store the model parameters. It is not a severe diﬀiculty in high-end GPUs but too costly to be afforded for many applications with low computational power.

3. Number of computing operations: The convolution operations are computationally intensive on high-resolution images. A deep CNN may demand some minutes to process one single data on a mobile device, making it unrealistic to adopt real applications. As the networks resume to become deeper, the number of parameters and computation costs of convolutional layers continue to head up, and consequently, the inference time increase.

4. Large-scale dataset: These CNNs usually must be trained by large scale datasets. It is a generally accepted notion that larger datasets result in better deep learning models [47]. However, assembling enormous datasets can be a very daunting task due to the manual effort to collect and label data.

To build a useful CC deep learning model, we present a simple yet effective network training scheme, which addresses all the aforementioned challenges when deploying large CNNs under limited resources and with few training datasets. We propose a network compression algorithm with five main steps:

1. The network architecture (a modified MobileNet version in our experiments) is defined and then trained on a big classification dataset.

2. The network is thinned by the structure pruning method to remove redundant neurons iteratively with the same classification dataset.

(11)

3. Transferring knowledge learned from the classification task to color constancy.

In transfer learning, a model is trained with a large volume of data and learns model parameter weights and bias. The model is then embedded into a new model for the target task that can be initialized with pre-trained weights and fine-tuned with the target dataset. In our proposed pipeline, the pruned compact model is transferred and trained to do color constancy.

4. The thinned model is retrained by a small CC dataset.

5. And finally, the dense-sparse-dense training flow is utilized for regularizing neural networks and achieving better optimization performance.

The proposed method simultaneously reduces the model size, run-time memory, computing operations while introducing minimum overhead to the training process, and the resulting model requires no special libraries/hardware for eﬀicient inference.

The rest of this thesis is organized as follows. Chapter 2 presents the basics of the color theory and theoretical background. Chapter 3 presents related works. Chapter 4 presents different deep neural networks architectures and optimization techniques such as regularization and batch normalization; moreover, we present fundamental network pruning. Chapter 5 presents implementation details and our experimental results. Finally, the thesis is concluded in Chapter 6.

(12)

2 Theoretical background

This chapter presents the basic knowledge and concepts of color constancy that will be used throughout our thesis. First, in subsection 2.1 we present the main components contributing to the color formation. This is followed by subsection 2.2, where color chart, gamma correction and, different color spaces and their conversions are discussed. The subsection 2.3 presents the color constancy problem formulation using mathematical models. Finally, in subsection 2.5 different color constancy methods used for illumination estimation are covered.

2.1 Color theory and image formation

Color is an important intrinsic attribute of an object and is formed due to interaction of three main components:

• Illumination

• Surface reflectance

• Visual system response

In the next subsections, we explain these three components in more detail.

2.1.1 Illumination

The existence of a light source is necessary to observe the color of the objects in a scene. Illuminant is a mathematical description of a light source and consists of multiple wavelengths. The first time, in the 17th century, sir Isaac Newton discovered that sunlight (named as white light) passing through a glass prism splits up into a color spectrum of wavelengths in the interval of 400-700 nm [46]. Electromagnetic radiation in this range of wavelengths is called visible light. Figure 2.1 illustrates the electromagnetic spectrum distribution in the space.

(13)

Figure 2.1 Electromagnetic spectrum in the space.

The visible light that we can see is just one small portion of the electromagnetic spectrum range [93]. Each color corresponds to a specific electromagnetic wavelength in the range of visible light. Wavelength is usually represented with λ, and the unit of the measurement is nanometers (nm).

Human eyes decode each visible wavelength into different colors. For instance, the wavelengths at∼ 435.8 nm,∼ 546.1 nm, and ∼700 nm have the sensation of blue, green, and red colors, respectively. The wavelengths below 400 nm are called ultra- violet (UV), and the wavelengths from the range greater than 700 nm are referred to as infrared (IR) [46, 30]. The UV and IR wavelengths analysis is important in medical imaging and remote sensing. In this thesis, our focus is on digital camera systems. Therefore we deal with the visible spectrum.

The illuminant (light source) can be shown with a normalized curve known as Spec- tral Power Distribution ( SPD ). This curve contains the characteristics of the light.

It illustrates the light’s power at different wavelengths in the visible spectrum [84].

Figure 2.2 illustrates the spectral power distribution of different illuminants.

(14)

Figure 2.2 Spectral power distribution of different illuminants [84]. (a) sunlight, (b) tungsten light, (c) fluorescent light and (d) Light-Emitting Diode (LED)

2.1.2 Reflectance

In addition to the illuminant, the objects’ surface property in the scene is involved in the color formation. When the radiant energy interacts with the object’s surface, the object will partly or wholly absorb, reflect, or transmit the light. The reflected wavelengths define the color of the object. For instance, when our brain sense the color of the object as red, it means that the object reflects the wavelengths belonging to red color (∼635-700 nm) and absorbs or transmit all other wavelengths and similarly, when our brain has sensation of white it means that the object reflects all the wavelengths uniformly [12, 30].

The amount of the reflectance or transmittance of the illuminant by an object depends on the object’s material, and it is an interior feature of the object independent from the illumination. Thus we can define an object’s spectral reflectance or spectral transmittance as a function of wavelength [84]. Figure 2.3 illustrates the spectral reflectance of different color patches of the GretagMacbeth color checker.

(15)

Figure 2.3 Spectral reflectance of different color patches of GretagMacbeth color checker [84]. (a) red patch, (b) light blue patch, (c) yellow patch and (d) gray patch

2.1.3 Visual system response

The third component involved in color formation is the visual system( i.e., the human visual system or camera module system). The human visual system consists mainly of the eyes, optic nerves, and brain. Figure 2.4 illustrates the human eye and the elements of visual perception. Each of these elements in the human visual system can be compared with the digital imaging system. For instance, the eye can be compared with the camera sensor, and the optic nerve is comparable with the transmission path.

(16)

Figure 2.4 Human visual system.

The color signal formation procedure in the human brain is as follows: the reflected illuminant from the object surface enters the human visual system through the cornea. Then the illuminant reaches the pupil and is refracted by the lens. The cornea and lens perform jointly as a compound lens to project an inverted image from the object onto the retina. This image will be sent to the brain by optical nerves for interpretation. Retina has two types of photoreceptors known as rods, and cones [46]. The rods are responsible for the scotopic or dim-light vision and give an overall picture of the field of view, while cones give us the sensation of color vision.

Cones are also responsible for perceiving bright-light and details in the scene. There are three types of cones known as short(S), middle(M), and long(L). The sensitivity peaks for these cones are in the range of (S, 420–440 nm), (M, 530–540 nm) and (L, 560–580 nm) [30]. These responses can also be denoted as s(λ), m(λ) and l(λ).

Figure 2.5 presents the normalized spectral sensitivity response of the cone cells.

Figure 2.5 The normalized spectral sensitivity response of human cone cells for the short(S), middle(M) and long(L) wavelength types.

As we discussed in 2.1.1 and 2.1.2 subsections, the SPD of the illuminant and

(17)

surface reflectance are also involved in the color formation. The product of the illumination’s SPD and surface reflectance determines the spectral power distribution of an object, which is known as color stimulus [84]. The color stimulus equation is shown in Equation 2.1:

f(λ) =E(λ)S(λ). (2.1)

wheref(λ) is the color stimulus of the object, E(λ) is the SPD of the illumination and S(λ) is the surface reflectance. When an color stimulus f(λ) is perceived, each of the three types of cones reacts to the stimulus by summing up the response at all wavelengths. By possessing these three independent values (Equations 2.2- 2.4), color information can be perceived.

X =

∫

λ

l(λ)E(λ)S(λ)d(λ). (2.2)

Y =

∫

λ

m(λ)E(λ)S(λ)d(λ). (2.3)

Z =

∫

λ

s(λ)E(λ)S(λ)d(λ). (2.4)

Here, X, Y, and Z correspond to the response of the short, middle, and long cons, respectively. The triplet (X, Y, Z) is called the trichromatic response. The small s in the equation corresponds to the short wavelength response, and capital S corresponds to the object reflectance. E(λ) denotes the SPD of illuminant, and the integration is performed in the range of visible light λ. The camera sensor is very similar to the human visual system, as it uses three color filters. These filters attempt to mimic human cone cells and generate a trichromatic response. An important consequence of the digital camera’s trichromatic response is that we only need three numbers at each pixel to capture the color information. The trichromatic response can be shown in different color spaces such as XYZ or RGB. Different color spaces are discussed in more detail in subsection 2.2.

2.2 Color space

As discussed in the earlier sections, the color sensation can be presented with three parameters named color tristimulus values. The amounts of these three primaries define the target color. A color space is a method that associates colors with these tristimulus values. Therefore, it is described by three primaries and their corresponding matching functions. So that each color can be represented as a point in a 3D plot. In this section, the most common color spaces are explained.

(18)

2.2.1 RGB color space

In 1931, the International Commission on Illumination (or CIE, for its French name) [27] proposed two sets of color matching functions for a standard observer, known as CIE XYZ and CIE RGB. These standards enable a consistent international method for measuring color differences and the establishment of color tolerances.

CIE RGB standard is still today one of the most used methods and is based on the idea that any color can be represented by a combination of three primary colors red, green, and blue (additive color mixture) [12]. CIE RGB color space is defined by three color matching functions (Equations 2.5 - 2.7) which are based on tristimulus values of HVS and represent the tristimulus values of the complete spectrum [12, 30].

R =

∫

λ

E(λ)S(λ)r(λ)d(λ). (2.5)

G=

∫

λ

E(λ)S(λ)g(λ)d(λ). (2.6)

B =

∫

λ

E(λ)S(λ)g(λ)d(λ). (2.7)

where E(λ), S(λ) correspond to SPD of illuminant and surface reflectance and r(λ), g(λ), b(λ)are the color matching functions. Figure 2.6 illustrates spectral tristimulus values for the CIE RGB system of colorimetry with monochromatic primaries peaks at 435.8 (blue), 546.1 (green), and 700.0 (red) nm.

Figure 2.6 Color matching functions for CIE RGB standard observer.

RGB color space can be represented as a triangle inside the tongue-shaped diagram. Each of the vertices of the triangle represents a primary color, red, green, and blue. All the colors within this triangle can be created using different RGB values

(19)

[116]. There exist different RGB working spaces based on the same principle, such as Adobe-RGB, AppleRGB, sRGB, etc. Subsection 2.2.3 illustrates in more detail the chromaticity diagram and the values corresponding to RGB color space within this diagram.

2.2.2 XYZ color space

In the early 1930s CIE established another color space, ”XYZ” which relies on the same principles as RGB color space. This color space encompasses all the possible colors, which can be perceived with the human visual system. Creating this color space was to remove the negative values in color matching functions and force one of the matching function to correspond for brightness or luminance [30]. This was completed by choosing two of the primaries, X and Z, independent of brightness and corresponding to chromaticity and Y value corresponding to luminance response.

XYZ tristimulus values for colored stimuli are computed in the same fashion as the RGB tristimulus values:

X = k N

∫

λ

E(λ)S(λ)x(λ)d(λ). (2.8)

Y = k N

∫

λ

E(λ)S(λ)y(λ)d(λ). (2.9)

Z = k N

∫

λ

E(λ)S(λ)z(λ)d(λ). (2.10)

where,λ is defined in the range of 380 nm to 780 nm, x(λ),y(λ) and z(λ)are color matching functions for red, green and blue colors. N =∫

λ

E(λ)y(λ)d(λ) and k is a scaling factor, usually set as 100. The _N^k normalization results in tristimulus values that are scaled from zero to approximately 100 for different materials. It is helpful to note that if relative colorimetry is utilized to calculate the tristimulus values of a light source, the Y tristimulus value is always equal to 100.

(20)

Figure 2.7 color matching functions for CIE XYZ standard observer.

2.2.3 Chromaticity diagram

The concept of color can be classified into two parts: brightness and chromaticity.

XYZ color space is three dimensional; however, to provide a suitable two-dimensional representation of color’s quality regardless of its brightness, the chromaticity coordinates can be expressed on a 2D chromaticity diagram as shown in Figure 2.8 [12].

The transformation from three-dimensional to two-dimensional space is performed through a normalization step, which removes the dimension corresponding to the Y coordinate. As we mentioned earlier, the Y coordinate corresponds to the luminance, and XZ corresponds to the chromaticity. The normalization step here is performed through 2.11 - 2.13 Equations, where each value is divided by the sum of XYZ values. Since the normalized value of X, Y, and Z components sum up to 1 and there are only two-dimensions of information in chromaticity coordinates, the third chromaticity coordinate can always be recovered from the other two[30]. For instance, z can be computed from x and y using 2.13 Equation.

x= X

X+Y +Z (2.11)

y= Y

X+Y +Z (2.12)

z = Z

X+Y +Z = 1−x−y (2.13)

Figure 2.8 illustrates all the chromaticities visible to the human visual system (e.i.

gamut of human vision system) by the chromaticity diagram; The curve around the tongue-shaped diagram is called spectral locus and corresponds to the monochromatic light measured in nanometers. The colors near the edge of the diagram are fully saturated, and the saturation gets reduced further toward the center of the

(21)

diagram. As can be seen the colors in the center of the diagram approach to white color because the fraction of the primary colors is equal (with chromaticity coordinates x=0.3333 and y=0.3333). This color is called CIE standard white point. Note that the white point is related to color and not to intensity or illuminant. When we want to know the white point of an illuminant, we want to know the chromaticity of a white object under that illuminant [116, 30].

The curved line in the middle of the diagram ranging approximately from 1500 kelvins to∞is the Planckian locus. This curve shows the special type of theoretical light source’s absolute temperature, which is referred to as a black-body radiator or Planckian radiator source. For any light source, which is approximately white, we can define a ”color temperature”. For instance, CIE illuminants A and D65 represents two different Planckian radiators with a color temperature of 2856 K and 6504 K, respectively [93].

The triangle inside the diagram shows the RGB color space. The coordinates of RGB primaries are on each angle of the triangle.

Figure 2.8 CIE 1931 chromaticity diagram with planckian locus.

2.2.4 XYZ-RGB color space conversion

In the previous subsections, important color spaces are introduced. In this subsection, the color conversion from RGB to XYZ and vice versa will be discussed. Before applying color space conversion, three important objects must be considered:

(22)

• The RGB values must be in a nominal range of [0.0, 1.0].

• The reference white point must be the same in both color spaces.

• The RGB values must be linear. For instance, if we have gamma-corrected RGB values (see 2.2.6), we need to apply inverse gamma companding to get a linear RGB image.

Conversion of RGB values to the CIE XYZ tristimulus values of the color can be achieved using Equations 2.14,



 X Y Z



= [

M ]

 R G B



 (2.14)

where M is a 3x3 matrix and computed as below.

[ M

]

=





S_rX_r S_gX_g S_bX_b S_rY_r S_gY_g S_bY_b S_rZ_r S_gZ_g S_bZ_b



 (2.15)

Each elements of the M matrix can be calculated with Equations 2.16 - 2.25.

X_r = x_r

y_r (2.16)

Yr = 1 (2.17)

Z_r = 1−x_r−y_r

y_r (2.18)

X_g = x_g

y_g (2.19)

Yg = 1 (2.20)

Z_g = 1−x_g−y_g

y_g (2.21)

X_b = x_b

y_b (2.22)

Y_b = 1 (2.23)

(23)

Z_b = 1−xb−yb

y_b (2.24)



 S_r S_g S_g



=





X_r X_g X_b Y_r Y_g Y_b Z_r Z_g Z_b





−1

 X_W

Y_W Z_W



 (2.25)

where,(xr, yr),(xg, yg)and(xb, yb)are chromaticity coordinates of RGB system and (XW, YW, ZW) is the reference white point corresponding to the color space [78].

Conversion from XYZ to linear RGB can be done with inverse M matrix as below



 R G B



= [

M ]₋1



 X Y Z



 (2.26)

The result will be in the range of 0.0-1.0 and the reference white must be the same for both reference and target color spaces here as well.

2.2.5 Color chart

The result of color rendition in photography, printers, television, computer monitors, etc., is not the same as objects color seen in reality. Color charts can be used to compare and evaluate reproduced colors quantitatively. The color rendition chart called Macbeth includes 24 different natural color patches. These patches contain a wide range of chromatic and achromatic colors, ranging from white to black. Figure 2.9 illustrates an image of color chart. The color chart values in the captured image can be compared to the reference values (e.g., ground-truth) to find the probable occurrence of the color bias in the image generated by the effect of the illuminant color. Using color chart values and reference values, we can estimate the illuminant in the image and apply color correction to the image pixels to convert it such that it appears to be captured under the natural illuminant [86, 87]. The color correction and color cast are discussed in more detail in section 2.3.

(24)

Figure 2.9Image containing Macbeth color chart. The color of the chart can be compared with a reference chart and be corrected if color cast observed in the image.

Figure 2.10 shows different values corresponding to each patch in color checker.

for instance the CIE-x, CIE-y and CIE-Y values corresponding to dark skin color are 0.400, 0.350, 10.1 respectively.

Figure 2.10 Values corresponding to each patch in Macbethchart [87].

(25)

2.2.6 Gamma correction

The brightness changes perceived by the human visual system are not the same as the physical brightness changes of the light in reality. For instance, when the photons are hitting the camera sensor get doubled, the camera acquires twice the original signals (a linear relationship). However, humans have a non-linear vision system and perceive the twice light as only a fraction brighter [111, 40]. As the human eye and camera do not perceive the brightness in the same way, gamma correction is used to correct the differences between how a camera captures content and how our visual system processes light. Figure 2.11 illustrates the difference between linearly and gamma encoded intensities. As shown in the gamma encoded state (Figure 2.11(c)), the changes are very smooth between two neighboring intensities.

However, in Figure 2.11(b), the dark tones are insuﬀiciently represented when the linear encoding is used.

(a) Original intensity levels.

(b) Linearly encoded intensity

(c) Gamma encoded intensity

Figure 2.11 Comparison between linear encoded and gamma encoded intensities.

Gamma correction can be applied by V_o = V_i^g, where g indicates the value of gamma,V_i and V_o denote the input luminance and output luminance, respectively.

When gamma < 1 is applied to the input image, a narrow range of dark input values will be mapped into a wider range of output values, whereas gamma > 1 does the opposite. Figure 2.12 shows the transfer characteristic plots for three gamma values, 0.45, 1 and 2.2. When we capture an image, the camera or raw development software usually applies an image gamma of 0.45 to the raw image (e.g., images in sRGB and Adobe RGB-1998 color space); subsequently, eﬀicient coding of the brightness can be achieved. Note that the input to the color constancy algorithms is usually a linear raw image.

(26)

Figure 2.12 Gamma correction curves [12].

Figure 2.13 illustrates two different images captured from the same scene with different gamma values. Applying gamma correction (gamma < 1) to the image results in a non-linear and brighter image.

(a) Linear raw image (b) Gamma encoded image.

Figure 2.13 Linear vs. Gamma corrected image. The left photo is a linear raw image with no gamma correction (gamma=1), and the second image is gamma-corrected, leads to brighter pixels compared to the linear raw image.

2.3 Color constancy problem formulation

Under the general assumption that a single light source illuminates the scene, the perceived pixel of an image located at coordinates (x,y) and captured by a trichromatic camera sensor can be modeled using the physical model of Lambertian image formation [55], expressed as follows:

f(x, y) =

∫

λ

E(x, y, λ)S(x, y, λ)R(λ)d(λ) (2.27)

(27)

where E(x, y, λ) indicates the illuminant spectral power distribution, S(x, y, λ) denotes the surface spectral reflectance and R(λ) indicates the sensor spectral sensi- tivities. The R and f are three-component vectors that contain r, g, b channels.

λ is the wavelength of the light, and the integration is performed in the range of visible light [14, 55]. With the assumption that the recorded color of the illuminant depends on the color of the light source E(λ), as well as the camera sensitivity functionR(λ), the color of the illuminant, is defined as follows:

E=



 E_R E_G E_B



=

∫

λ

E(x, y, λ)R(λ)d(λ), (2.28)

The color constancy method aims to estimate the color of the scene illuminant. Since bothE(x, y, λ) and R(λ) are unknown, the estimation of illumination is an under- constrained problem that cannot be solved without further assumptions. Hence, practically, color constancy algorithms are based on various simplifying assumptions such as restricted gamuts and the color distribution within an image. In subsection 2.5 different CCC assumptions and methods are discussed.

2.4 Image correction

Usually, a color constant image is generated in two steps. In the first step, the illuminant’s chromaticity is estimated, and in the second step, the input image is color corrected based on the estimated illuminant chromaticity.

The CCC algorithm’s input is a three-band, RGB digital color image with an unknown light source. After illuminant estimation, the image can be color corrected using a global diagonal model of Von Kries [21, 42] as shown in Equation 2.29. In this way, all the camera responses on R, G, and B channels are scaled independently by coeﬀicients E = {E_R, E_G, E_B} to transform the color of the input image pixels such that the corrected image appears to be taken under a canonical light source.

These coeﬀicients are applied to all image pixels [82, 100].

I^′(x, y) =



 ER−1

0 0

0 EG−1

, 0

0 0 EB−1



I(x, y), (2.29)

where, I is the image taken under an unknown light source, I^′ is the the corrected image under the canonical light source and {ER, EG, EB} = ∫

λ

E(x, y, λ)R(λ)d(λ), is the estimated color of the light source. The colors under the canonical illuminant then produce a color-constant representation of the scene colors.

(28)

2.5 Color constancy methods

A wide range of methods is proposed to mimic human eyes’ color constancy ability for camera systems. These methods can be classified into three main categories, known as gamut based, learning-based, and static based methods. The first two algorithms have learnable parameters and require a training phase. In these methods, almost rg-chromaticity (2D) space is used instead of RGB-chromaticity (3D), as follows:

r=R/(R+G+B),

g =G/(R+G+B). (2.30)

If necessary, the implicit blue chromaticity component can easily be recovered:

b= 1−r−g (2.31)

The rg-chromaticity space has the benefit that it is limited between 0 and 1. It also reduces the number of learnable parameters for CNN models and reduces the training time, which can be count as pros of this approach.

The static algorithms do not rely on training data but are based on assumed sta- tistical or physical properties of the image formation under a canonical light source [44]. In the next subsections, we explain the key components of these three CC methods in detail.

2.5.1 Static methods

Static methods assume that an image that is taken under a canonical light source follows certain properties. They are relying on reflection models or low-level image features statistics (e.g., color distribution). The best-known static algorithm is Grey-Word [44], which is based on the assumption that the average color in the image captured under the canonical light source is achromatic (i.e., gray). Conse- quently, any difference from achromatic color in the average scene is produced by the effect of the illuminant. This conveys that the color of the illuminant can be estimated by computing the average color in the image [14, 20].

Land et al. [74] proposed the White-Patch method, which is based on the assumption that any image contains at least a white patch and the maximum responses in the RGB-channels are produced by a perfect reflectance of that white patch. Since a white patch with perfect reflectance property reflects the full range of captured illuminant, the color of this perfect reflectance is exactly the color of the illuminant [22, 44].

Another well-known color constancy algorithm is relied on the Grey-Edge hypothesis, which is based on the derivative structure of the image [112]. Specifically, this

(29)

method is based on the assumption that the average color of the edges in an image is gray, and any variation in the gray color of the average edges is produced by the illuminant color. This implies that the illuminant color can be estimated using the edges’ average color in the image.

The aforementioned statistics based methods are unified for the given image values f(x,y) to a single framework by Van der Weijer et al. [112]

e(n, p, σ) = 1 k

(∫ ∫

|∇ⁿf_σ(x, y)|^pd(x)d(y) )¹

p

, (2.32)

where x and y are the spatial coordinates in the image, and the integration is performed over all pixel coordinates. k is a multiplicative constant to be chosen such that the illuminant colore=(R_e, G_e, B_e)^T has unit length (usingL2norm). n is the order of the derivative, and p is the Minkowski norm. f_σ(x, y)=f(x, y)∗G_σ(x, y)is the convolution of the image with a Gaussian filterG_σ(x, y)with scale parameterσ.

The assumptions for the statistic methods are mainly based on the distribution of the colors in the image, and each parameter in Equation 2.32 is tuned for the color distribution in the dataset. Using different instantiation forn, p, σvalues in Equation 2.32 we can get different statistics based methods to estimate the illuminant. For example the (n, p, σ) instantiation for the Gray-word and White-Patch algorithms are (0,1,0) and (0,∞,0) respectively. Some of the methods mentioned above need a preprocessing step to improve the performance of the algorithm. For example, the White-Patch algorithm applies some smoothing to the image, and the Gray-Word algorithm first segments non-uniform areas within the image and then computes the average color of all segments[112, 44].

Although more elaborate algorithms exist, methods like Grey-World, White-Patch, and their extended versions (such as Max-Edge) are still widely used because of their low computational costs. Table 2.1 represents the overview of the static based methods with corresponding parameter values and hypothesis.

Algorithm n p σ Hypothesis

Grey-World 0 1 0 The average reflectance in a scene is achromatic.

White-Patch 0 ∞ 0 The maximum reflectance in a scene is achromatic.

Shades of Grey 0 p 0 The pth-Minkowsky norm of a scene is achromatic.

General Grey-World 0 p σ The pth-Minkowsky norm of a scene is achromatic after local smoothing.

Grey-Edge 1 p σ The pth-Minkowsky norm of the image derivative in a scene is achromatic.

Max-Edge 1 ∞ σ The maximum reflectance difference in a scene is achromati.

2nd order Grey-Edge 2 p σ The pth-Minkowsky norm of the second order derivative in a scene is achromatic.

Table 2.1 Overview of statistic methods with corresponding parameter values and hy- pothesis.

Most of these methods assume that the illumination color is one of the scene pixel colors. However, the Probabilistic Color Constancy (PCC)[73] assumes that the illumination color belongs to the convex hull of the pixel colors present in the

(30)

scene. Thus, it can be obtained as a convex combination of the scene colors. The PCC estimates the scene’s illumination by weighting the contribution of different image regions using a graph-based representation of the image. To estimate each superpixel’s weight, PCC relies on two assumptions: Superpixels with similar colors contribute similarly, and darker superpixels contribute less. The PCC algorithm achieves competitive performance, compared to the unsupervised state-of-the-art, on INTEL-TAU datasets [71, 1].

2.5.2 Gamut based methods

Gamut is the range of colors that can be reproduced by a device as determined in some proper three or more dimensional color spaces (e.g., sRGB or AdobeRGB) [30]. To evaluate a color against a gamut, the chromaticity diagram is commonly used. However, using the chromaticity diagram may lead to the wrong estimation since the chromaticity diagram discards the luminance information [78] (for more details see 2.2 section). In reality, color is 3D and is referred to as a color volume.

Figure 2.14 illustrates the approximation gamut of an image extracted from the set of all observed RGBs under an unknown light source. This set is convex and is represented by its convex hull.

Figure 2.14 Approximation of the gamut by a convex hull. The gamut of the image is the range of RGBs represented in that image.

The color constancy via gamut mapping utilizes the gamut of the image to estimate the illuminant color. The gamut mapping approach firstly introduced by Forsyth et al. [74], is based on the assumption that under given illuminant, the observer can perceive only a limited set of colors in the real-world images. Con- sequently, any deviation from this set of colors occurs due to the deviation in the illuminant color. In other words, observing a color outside the canonical gamut shows that the light was something other than the canonical.

In their method, firstly, a training phase is used to find the canonical gamut C by observing as many images as possible under the canonical illuminant. As a result, the set of all possible surface RGBs under a known, canonical illuminant is formed.

This set is represented by its convex hull. The set of all possible RGBs for the

(31)

unknown light source can be represented in the same way by its convex hull. Under the diagonal assumption of illumination change (e.g., Von Kries model), these two gamuts are a unique diagonal mapping of each other. Figure 2.15 illustrates the canonical gamut, which is obtained in the training phase.

Figure 2.15The canonical gamut (convex hull) formed in the training phase by observing as many images as possible under the canonical light source.

After finding the set of all possible RGBs for the canonical light source (i.e., canonical gamut), a unique diagonal mapping is sought. This diagonal mapping aims to transform the input image’s gamut under an unknown light source such that it lies completely within the canonical gamut. However, since it is impossible to discover all possible RGBs for the unknown light source, it is estimated by the observed input image pixel values. The resulting gamut for the input image under an unknown light source is just a subset of RGBs; several diagonal mappings can map the input image gamut into the canonical gamut.

Figure 2.16 Color constancy by gamut mapping

Forsyth [74] provides a method for effectively computing the set of feasible mappings and selecting one among all. Figure 2.17 represents gamut mapping in a two dimensional space. The x-axis corresponds to the r = _B^R values and y-axis corresponds to the g = ^G_B values. Note that in the original work, the gamut was represented in a three-dimensional space (RGB), but here we use two-dimensional

(32)

space for better visualization.

Figure 2.17(a) represents the canonical gamut, and Figure 2.17(b) consists of three chromaticities labeled by a, b and c. These chromaticities do not correspond to a single surface, and each of them represents a different surface. Here the goal is to find the diagonal mapping, which transforms these surfaces’ chromaticities to chromaticities under the canonical light.

chromaticityacan be mapped to any point inside the canonical gamut by a diagonal transformation. So, there exists a set of feasible mappings to map pixel a into the canonical gamut (See thin dark blue polygon in the diagonal mapping space). Chro- maticityb and ccan be similarly mapped to the canonical gamut, and each of them also form a different set of possible mappings. The intersection of these mappings defines the final feasible sets of mappings. The region filled by light blue in 2.17(c) shows the intersection region for these mappings. By iterating this process for all pixels in an image, we result in a set of mappings consistent with the image data.

(33)

(a) Canonical gamut image (b) The image is consisting of three chromaticities labelled as a, b and c. The gamut mapping methods are trying to find the chromaticities under the canonical light corresponding to a, b and c.

(c) The set of diagonal mapping that can be applied to the input gamut and transform it to the canonical gamut.

Figure 2.17 Diagonal mapping

Finally, the algorithm selects one mapping among all and applies it to the canonical illuminant to estimate the illuminant color for the input image. The choice of the final mapping is a point of discussion in the literature; for instance, the original approach [74] used the diagonal matrix with the largest trace as a diagonal mapping, while Barnard et al. [9] used the average or weighted average of the feasible set of mappings.

(34)

The original gamut mapping algorithm has two problems; 1) the diagonal model may fail if the algorithm can not find a feasible set of mappings, which leads to a null solution. 2) The algorithm is computationally expensive [44]. To solve these problems several extensions are proposed in [33, 35, 89]. Finlayson et al. [35] proposed to use 2D (chromaticity) space instead of 3D (RGB) space to reduce the complexity of the algorithm. Barnard et al. [9] improved the diagonal model’s failure by extending the canonical gamut. In their approach, in addition to the gamut of the canonical light source, the model learns the gamut of different illuminants mapped to the canonical illuminant using the diagonal model. (See [38, 8, 37, 35, 32] for more details on gamut mapping algorithms).

2.5.3 Learning-based methods

Most of the classical CCC methods presented in subsection 2.5.1 and 2.5.2 are not generalized well and can not guarantee suﬀicient accuracy for all types of images.

The reason is that these methods are based on simplifications and error-prone assumptions. Another class of color constancy algorithms is the learning-based methods, which utilize different machine learning algorithms to estimate the scene’s illuminant.

Learning-based algorithm can be classified into two categories according to what they learn: combinatorial and direct methods. The former involves methods that discover the most suitable combination of the statistics-based methods for an input image based on the scene contents. In this approach different scene properties, including scene objects semantics [113] 3D objects and scene geometry [83], low-level visual properties [16], and natural image features [43] are extracted from the image content and used to find the best combination for illuminant estimation [44]. Gi- jsenij et al. [42] analyzed the low-level properties of the images, and accordingly, the proper method is selected. For instance, if an image contains a few edges, the White-Patch method is preferred to the Gray-Edge method. Bianco et al. [17] presented a classifier, which classifies images into indoor, outdoor, or uncertain classes.

Then the proper color constancy method is learned for each of these three groups.

The latter involves methods that aim to learn illumination directly by learning the model from the training data. This group includes probabilistic algorithm [36, 18], selection based algorithm, svr-based algorithm [39], examplar-based algorithm [64]

and numerous CNN-based algorithms [21]. These methods can rely on low-level or high-level visual features [44]. Low-level features refer to the edges and pixels in the image, and high-level features are hierarchical properties such as object parts and object models.

One of the essential parts of the learning-based methods is feature selection. The feature selection in most of the learning-based approaches mentioned here is per-

(35)

formed manually. Development of machine learning techniques, especially neural networks, enable the CCC algorithm to automatically perform feature selection and learn the low-level and high-level features of the image simultaneously [100]. Most recent color constancy methods using CNNs formulate this problem as a regression problem [14, 119, 82]. In chapter 3, we review the state-of-the-art CNN methods used for illuminant estimation.

(36)

3 Related work

In recent years, following the massive success of deep learning techniques and especially convolutional neural networks, different accurate and fast CCC-algorithms are proposed. These CNN-based algorithms learn the mapping function between the input image and the ground-truth illuminant label while requiring minimal do- main knowledge. There are several CNN-based methods for illumination estimation, which operate on the full input images or small local patches [10, 99, 14, 15, 59, 70].

Barron et al. [10] proposed an algorithm based on the observation that scaling the color channels of an image causes a translation in the Log-Chroma (LC) histogram of that image. Accordingly, they formulated the CCC problem as localization of a template in LC space, where the sample detection algorithm is used for illumination estimation. In their method, the input image is first transformed into a scale, pre- serving augmented images using simple image processing operations (e.g., median filters). Each augmented image extracts a different feature of the input image, such as edges, textures, and highlights. These images are then turned into a set of LC histograms for which convolutional filters are learned to estimate feasible illumination color in the chroma plane discriminatively.

Lou et al. [82] reformulated the CCC problem as a regression task. In their work, a convolutional network with eight layers is used for illuminant estimation. Their network is trained in three steps to deal with the lack of data. In the first step, the model is trained by ImageNet [29] dataset to learn generic feature hierarchies.

ImageNet is useful for many computer vision applications such as object recognition and image classification, but it does not consist of the illuminant color ground truth.

The CNN model is then fine-tuned using the CCC labels coming from running the existing state of the art CCC algorithms on the same ImageNet dataset. Finally, the model is trained on a real color constancy dataset to estimate the illuminant color.

The aforementioned CNN algorithms use the full image for illuminant estimation.

The first work using CNNs to estimate illuminant for small image patches was de- veloped by Bianco et al. [14]. In their work, small patches are sampled from the input image and contrast normalized using a histogram stretching technique. These small patches are fed to the CNN network consisting of five layers. The network extracts the image’s local features and then passes these features to a support vector regressor to estimate the illumination color for the given patch [15]. The result of these patches is then combined to estimate the final illumination of the full image.

Using small patches in their method solves the lack of data problem but at the cost of losing the semantic information. Ignoring the semantic features may lead to the

(37)

issue of estimation ambiguity, where a patch (such as textureless walls) does not contain suﬀicient information for illuminant estimation [15].

In [69], it is argued that the spatial information is not essential in the color constancy context and, thus, it is discarded using a Bag of Features (BoF) layer [90]. The resulting Bag of Color Features (BoCF) network comprises three blocks: the feature extraction, the Bag of Features, and the illumination estimation blocks. In the first block, a nonlinear transformation of the scene is produced. In the second block, a histogram representation of this transformation is compiled. This histogram is used in the third block to approximate the illumination. For extreme samples, the BoCF method fails and leads to high errors. Monte Carlo Dropout Ensembles (MCDE) model [72] addresses this limitation by proposing a novel scheme to aggregate different CNN-based models based on their confidence in illumination estimation. The MCDE model estimates the relative uncertainty of each model for a test sample using Monte Carlo dropout. The final illumination estimation is computed as the sum of the different models’ estimates weighted by the log-inverse of their corresponding uncertainties.

Shi et al. [99] proposed a network architecture to handle the ambiguities in the image and improved the performance of the patch-based CNN algorithms for illumination estimation. Their network architecture comprises two interacting sub-networks called Hypotheses Network (HypNet) and Selection Network (SelNet). The former creates various illuminant hypotheses that innately capture several modes of illuminants with its unique two-branch structure. Then the latter sub-network selects the final illuminant hypothesis created from one of the branches in HypNet. The global illuminant is estimated for the full input image by performing a median pooling on the local estimates.

Hu et al. [59] proposed FC4, a fully convolutional neural network architecture in which patches within an image can have varying confidence weights according to the value they provide for color constancy estimation. Their network’s backbone (the first part of the network up to the fifth layer) is constructed by AlexNet, pre-trained on the ImageNet. This part of the network aims to extract the semantic features of the image patches. Two relatively large convolutional layers with randomly initialized weights were further used as the network’s neck and simultaneously trained with the pre-trained backbone to output the Semi-Dense (SD) feature maps. The size of the SD feature maps is (₃₂^w × ₃₂^h × 4), where the first three channels correspond to the color triplet of the illuminant, and the fourth channel outputs the confidence weights of different image regions. Finally, SD feature maps are fed to the Confidence Weighted Pooling (CWP) layer to combine local estimates into a global one. The final estimate is simply a weighted-average pooling of all the local estimates. The advantage of FC4 architecture is that it allows images to have an

(38)

arbitrary size. Additionally, the CWP layer helps the algorithm distinguish between semantically ambiguous (such as textureless walls) and valuable local regions (e.g., with rich texture, faces, and achromatic objects). These semantically valuable or informative patches are used to estimate the global illuminant of the image.

The network comprises several components and custom operations, which is not well-suited for mobile and embedded devices. Development of such complex algorithm on embedded devices which requires multiple conversions (to TensorRT [114]

or Tensorflow Lit [107]) have proved to be a challenging task.

We utilized existing state-of-the-art CNN and explored multiple solutions to make the network run at higher speeds. Guided by an analysis of the neural network’s computational demands, we pruned our baseline model and made it lightweight with lower computational and storage budgets.

(39)

4 Optimization of deep learning models

Artificial Intelligence (AI) and machine learning have had incredible success in all fields of technology thanks to hardware technology development and the increase in available big data, which are essential to allow a good training phase in machine algorithms learning. An Artificial Neural Network (ANN) is a computing system that aspires to function exactly as a human brain works and therefore tries to mimic the processes within it. What has prompted us to deepen these models’ study is their ability to model very complex nonlinear functions. The elementary components of these models are neurons, which are organized in layers. In a nutshell, the neurons receive an input (x_j), execute a dot product and compare them with a threshold (b), and produce an output (y_j) which is then transmitted to other neurons:

y = {

0 if∑

(w_j.x_j) +b ≤0 1 if∑

(w_j.x_j) +b >0 (4.1) Then, cost or loss function, fundamental in the ANN training phase, measures how close to good results is the output. There are many methods to minimize the loss, the most used of which is the Gradient Descent (GD). The GD optimizer computes the gradient of the loss function concerning the parameters and updates the parameters in the opposite direction:

v_i =γv_i₋₁−η▽θL_all(θ) (4.2)

θ =θ−v_i (4.3)

withθ being the parameters,▽θLall(θ)the gradient of the loss respect to the parameters, η the learning rate, i the iteration number, and γ the momentum hyperpa- rameter. Note that the momentum parameter adds a fraction of the past time step’s update vector, helping to reduce oscillations and increase the convergence speed.

Deep learning refers to a particular kind of ANN. This network’s main feature is just the number of hidden layers that must be greater than one. In Deep Neural Networks (DNN) applied to image processing, convolutions have replaced simple matrix multiplications. Actually, the convolution operator, in ANNs, allows to con- siderably reduce the number of variables in a layer and transforms the images into feature maps, decreasing the computational complexity and the memory resources needed. The layer in which a convolution is computed is called the convolutional layer, and in there, the input is convolved with multiple matrices, called filters or

(40)

kernels.

This chapter presents the key theoretical concepts behind the deep learning models and the convolutional networks’ optimization techniques. Section 4.1 presents general network architecture. Section 4.2 introduces different optimization methods such as regularization, dropout, and batch normalization, and finally, in section 4.3 we introduce the methods and main components of network pruning.

4.1 Network architectures

Convolutions neural network is the most widely used deep learning model in feature learning for large-scale computer vision tasks. A CNN layer generally consists of three layers, i.e., convolutional layer, sub-sampling layer (pooling layer), and fully- connected layer. The convolutional layer uses the convolution operation to reach weight sharing, while the sub-sampling is used to decrease the dimension. The sub-sampling can typically be performed by an average pooling operation or a max- pooling operation. Afterward, multiple fully-connected layers and a softmax layer are typically placed on the top layer for the tasks of classification or recognition or other computer vision applications. The deep learning algorithm consists of a hierarchical architecture with several layers, each of which constitutes a non-linear information processing unit.

With the rapid development of computation techniques, CNN models employ deep architectures and represent higher complexity functions. Krizhevsky et al.[67] introduced an architecture namely AlexNet and won the LSVRC-2012 [96] competition.

This architecture was designed to classify ImageNet dataset[29] (containing over one million images of size227×227×3) into 1000 different classes. The main superiority of AlexNet compared to previous works was in the increased depth of the model.

Using multiple GPUs to have enough memory space to train the network improved performance and simultaneously reduced training time. AlexNet consists of eight layers: five convolutional layers (C1, C2, C3, C4, C5), followed by Rectified Linear Units (ReLU) as activation functions, three max-pooling layers (after C1, C2, and C5) and three fully-connected layers, where the last one is a softmax layer that gives the probability of belonging to a certain class of the image.

VGG model [101] which came after AlexNet is deeper. The item that makes it particular is the use of much simpler hyper-parameters for the convolutional layers. The convolution filters’ size is mostly3×3with the stride of 1, and the spatial padding for the convolutional layer input is such that the spatial resolution is preserved after convolution. The channel size gets twice after each pooling except for the last convolutional layer. This network’s input size is 224x224x3, and same as AlexNet, the last fully-connected layer is fed to a 1000-way softmax for the classification task.

Many computer vision tasks do not have good accuracy when we use shallow net-