Deep Neural Networks for Image Denoising

(1)

DEEP NEURAL NETWORKS FOR IMAGE DENOISING

Master’s Degree Programme in Information Technology Computing and Electrical Engineering Supervisor: Prof. Karen Egiazarian (Eguiazarian) 29th March 2020

(2)

ABSTRACT

Pham Huu Thanh Binh: Deep Neural Networks for Image Denoising Master’s Degree Programme in Information Technology

Tampere University

Major: Data Engineering and Machine Learning March 2020

This master thesis introduces non-local, learning based denoising methods and proposes a new method called FlashLight CNN for denoising gray-scale images corrupted by additive white Gaussian noise (AWGN). The proposed method is designed based on the combination of deep convolutional and inception networks that improves the learning capacity of the deep neural networks by addressing typical training deep neural networks problems. The proposed method demonstrates state-of-the-art performance both based on quantitative and visual evaluations.

Keywords: neural network, denoising, deep learning, machine learning, image processing, learning- based

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

PREFACE

This thesis work was conducted in Noiseless Imaging Oy, the thesis was mostly written in the office at 8th floor, Kampusareena building.

I would like to thank my supervisors Prof. Karen Egiazarian and Prof. Alessandro Foi for the opportunities and supports from them. Special thanks to Cristovao Cruz for his guidance and support during the thesis work. I really appreciate him.

I also want to thank my colleagues from Noiseless Imaging Oy and friends for their support and for spending time with me during my thesis work and studying time.

I would like to thank my parents, brother and Janina for motivation and encouragements.

, Pham Huu Thanh Binh

(4)

LIST OF FIGURES

2.1 A simple perceptron model. . . 3

2.2 A simple Artificial Neural Network . . . 4

2.3 A typical working process of Artificial neural networks . . . 4

2.4 The first CNN by LeCun . . . 5

2.5 The conceptual of these subsampling methods . . . 6

2.6 The fundamental unit of VGGNet. . . 7

2.7 A simple VGGNet block. . . 7

2.8 Networks in networks comparison . . . 8

2.9 Typical plain and residual network structures . . . 9

2.10 Inception naive module . . . 10

2.11 The comparison of Inception-V4 with other popular Neural Networks [7] . . 10

2.12 The comparison of the accuracy of popular Neural Networks . . . 11

2.13 Typical Res-Inception structures . . . 11

2.14 The basic Resnet block . . . 12

2.15 The basic Dropout Resnet block . . . 12

2.16 The selected state-of-the-art denoising methods’ structure. . . 17

2.17 An illustration of non-local mean algorithm . . . 19

2.18 An Illustration for block-matching algorithm [14]. . . 19

2.19 A process flow diagram of BM3D-SAPCA. [16]. . . 20

2.20 Early proposed CNN architecture . . . 22

2.21 DnCNN architecture. . . 23

2.22 An overview of GAN-CNN Based Blind Denoiser (GCBD) architectureX, Y are the paired training dataset. [8] . . . 24

2.23 The detail architecture of Multi-level Wavelet-CNN (MWCNN) [43] . . . 25

2.24 The flow of BMCNN . . . 26

2.25 The general architecture of BMCNN . . . 26

2.26 The flow of NN3D . . . 27

2.27 The general architecture of BM3D-Net . . . 28

3.1 The warm-up phase . . . 31

3.2 Inception layer used in the proposed architecture. . . 31

3.3 The proposed FlashLight CNN (FLCNN) architecture . . . 32

3.4 Validation performances vs number of parameters, when the number of parameters of Neural Network (NN) increases from one to about two million parameters. The performance of DnCNN like network decreases drasti- cally, while FLCNN sees increased performance. . . 33

3.5 The sample image from the dataset DIV2K [1] dataset. . . 34

(7)

3.6 Separating blocks from the Figure 3.5 . . . 34

4.1 Set12[48] . . . 37

4.2 Denoising results for house in Set12[48], withσ= 25 . . . 39

4.3 Denoising results for boat inSet12[48], withσ= 25 . . . 40

4.4 Denoising results for airplane image inSet12[48], withσ= 50 . . . 41

(8)

LIST OF TABLES

3.1 The configuration of the proposed network is selected through different experiments based on Peak signal-to-noise ratio (PSNR) and the number of trainable parameters. . . 33 4.1 Performance comparison in terms of PSNR on Set12, BSD68 and Ur-

ban100with noise levels of 15, 25, 50. . . 36 4.2 Performance comparison in terms of PSNR and SSIM forSet12withσ= 15. 37 4.3 Performance comparison in terms of PSNR and SSIM forSet12withσ= 25. 38 4.4 Performance comparison in terms of PSNR and SSIM forSet12withσ= 50. 38

(9)

LIST OF SYMBOLS AND ABBREVIATIONS

Adam Adaptive Moment Estimation ANN Artificial Neural Networks AWGN additive white Gaussian noise BM3D Block-matching and 3D filtering

BM3D-Net Network inspired by the BM3D method

BM3D-SAPCA BM3D Image Denoising with Shape-Adaptive Principal Component Analysis

BMCNN Block-Matching Convolutional Neural Network

BN Batch Normalization

CNN Convolutional Neural Network

CNNF CNN-based filter

DCNN Deep convolutional neural networks

DiCNN Deep Inception convolutional neural network DnCNN Denoising convolutional neural network

FLCNN FlashLight CNN

GAN Generative Adversarial Networks GCBD GAN-CNN Based Blind Denoiser GPUs Graphics Processing Units

ML Machine Learning

MLP Multilayer Perceptron MWCNN Multi-level Wavelet-CNN

N2N Noise2Noise

NIN Network In Network

NLF nonlocal filter

NLM Non-local mean

NLS Non-local self similarity

NN Neural Network

NN3D Neural Network and 3D collaborative filter PSNR Peak signal-to-noise ratio

(10)

ReLU Rectifier Linear Unit

SGD Stochastic Gradient Descent SSIM structural similarity index measure WNNM Weighted Nuclear Norm Minimization

(11)

1 INTRODUCTION

One of the fundamental problems in image processing is image denoising which aims to recover an image from a noisy estimation. The denoising process is significantly involved in different stages of image processing such as image pre-processing, post-processing and external filters. Implementing denoising techniques in the pre-processing stage improves the quality and performance of given images. This stage is normally implemented in image demosaicing, sharpening, compressing and machine learning tasks related to images. Image denoising also allows preventing artifacts phenomena, such as contour- ing and ringing in the post-processing stage. For the external filters, denoising plays an essential role in optimization and inverse-imaging problems [11]. There are different denoising methods which have been proposed over the past decades [19], [17], [44]. Some of these algorithms such as BM3D [13], WNNM [25] are considered standard as they can produce excellent performances. Recently, with a rapid advancement of the machine learning field, in particular, deep learning, image denoising techniques have made great progress and can manage to achieve competitive or even better performance than model-based denoising methods [40].

Deep Learning is a branch of NN algorithms that is composed of the input, multiple hidden and output layers. Deep NNs recently have achieved outstanding performance in computer vision field. For deep learning approaches for image denoising, numerous methods refereed to as learning-based methods, have managed to attain excellent denoising results [72],[11].

In this thesis, a new method for image denoising based on deep neural networks called FlashLight CNN is proposed. It is designed to address the typical problems of training deep neural networks and improving denoising performance of state-of-the-art deep learning methods. Quantitative and visual evaluations of the proposed method are also presented.

This thesis is organized in five chapters: the first chapter is the introduction; the second chapter is the theoretical background which introduces the fundamental concepts, advanced techniques of neural networks and essential state-of-the-art image denoising methods based on non-local and learning-based strategies; chapters 3 and 4 present the proposed image denoising method and the evaluation results of the method; the last chapter is the conclusion that covers the main content of the thesis and suggests several strategies to improve the proposed implementation in practice and the future.

(12)

2 THEORETICAL BACKGROUND

This chapter introduces the theoretical background of the thesis topic. Firstly, the essential and advanced concepts about the neural networks are presented. For the neural networks, the CNN based architectures and advanced training techniques are introduced and discussed since the proposed denoising methods are based on learning methods.

Secondly, the chapter continues with image denoising. In this part, state-of-the-art denoising methods based on non-local, learning-based methods and the combination of two methods are introduced. Finally, the advantages and disadvantages of non-local and learning-based denoising approaches are discussed.

2.1 Neural Networks

Artificial Neural Networks (ANN) is a computational framework that is significantly inspired by human-brain [49] and is formed from stacked multi-layers constructed from intercon- nected nodes. The implementation of neural networks has been widespread in various scientific fields such as computer vision [37], speech recognition [24], and natural lan- guage processing [10]. This section shows basic structure of a simple NN and different proposed modern NN, in particular CNN architectures. Additionally, several state-of-the- art training neural network methods also are reviewed.

2.1.1 Simple Neural Network Model

The fundamental unit that allows neural nodes to learn is known as the perceptron, where each neural node receives external signals and performs linearly separable functions based on their learning weights and optional biases. This process is described in Figure 2.1, where, the output of the model is calculated as y = f(∑n

i=1w_ix_i+b), x₁, x₂, ..., x_n are input signals, w₁, w₂, ..., w_n are learning weights,bis the optional bias andf(·)is an activation function (such as sigmoid [46], ReLU [46]). An artificial neural network is built from a trainable Multilayer Perceptron (MLP). A simple ANN with 3 layers is illustrated in Figure 2.2.

The perceptron’s weights are randomly initialized and designed to connect the values of the previous layers and estimate the output for the next layers or the network ’s output. The values of weights are adaptively estimated through multiple forward and back

(13)

propagation processes until they have the optimal values of weights to allow the neural networks to have highly acceptable outputs.

The ANN algorithm is illustrated in Figure 2.3 and summarized in the following 5 steps:

• Step 1: The values of neural nodes or weights (θ) are randomly initialized by specific methods such as zero-mean Gaussian distribution [37], orthonormal matrix initialization [55], etc. A suitable initialization for the neural weights allows the model to improve the accuracy and training speed [66].

• Step 2: Based on the given weights of networks, the forward propagation calculates the output by the given inputs.

• Step 3: A loss function is responsible for estimating the distance between the expected outputs from the step 2 and the given inputs.

• Step 4: Estimate the gradient ∇_θJ(θ) for the weights of networks through a back propagation process.

• Step 5: The weights are adjusted correspondingly to each interaction step. There are several methods to update the neural weights. The most popular one is the gradient descent [52] where the values of θ are decreasingly updated after each iteration step.

θ=θ−η∗ ∇_θJ(θ),

whereθ, η,∇_θJ(θ) are the weights values, learning rates and gradient with respect toθ, correspondingly.

The neural weights are updated through multiple iterations until the model is converged.

The number of interactions depends on how the learning rates and the optimization methods are set.

x₂ w₂

Σ ^f

Activate function

y Output

x₁ w₁

x_n w_n

Weights

Bias b

Inputs

... ...

Figure 2.1.A simple perceptron model.

(14)

Input #1 Input #2 Input #3 Input #4

Output Hidden

layer Input

layer

Output layer

Figure 2.2.A simple Artificial Neural Network is composed of three layers. The first layer has four nodes and each node links with the next hidden layer by a connection among each node from the previous and all nodes from the next layer.

Step 1:

Neural weights random initialization

Neural weights Step 2:

Forward propagate

Step 3:

Calculate loss function

Step 4:

Back propagate Step 5:

update neu- ral weights

Input

Figure 2.3. A typical working process of ANNs

2.1.2 Convolutional Neural Networks

CNN has gained popularity in the 1990s when LeCun introduced the first CNN version trained by the gradient-based learning algorithm to perform handwriting recognition for digits [38]. CNN architectures are composed of three ideas including local receptive fields, shared weights and spatial subsampling. These ideas are represented by two types of convolutional layers including convolutional and subsampling ones. The convolutional and subsampling layers are placed into multiple planes called feature maps. All units located in a feature map have the same operations for the different local receptive fields of images. The extracting feature maps from the data used by CNN are inspired by

(15)

the human visual system and refer to the concept named "local receptive fields." With local receptive fields, neurons can extract elementary visual features from images such as oriented edges, corners. The extracted features from layers are combined to get higher- level features. Each convolution layer is formed from the stack of different features maps, and allow a specific local receptive field to extract multiple different features. A typical CNN architecture is shown in the Figure 2.4.

Figure 2.4. The first CNN by LeCun et al. (LeNet-5). The dimensions of the input data are 32x32 pixels. There are three convolutional layers C1, C3 and C5 that are formed from 6, 16 and 120 feature maps. There are two subsampling S2,S4 that are formed from 6 and 16 feature maps. The last layer is a full connected layer, F6, which is responsible for processing the classification task. [38].

In convolutional layers, each convolutional unit is connected to a small input region called the local receptive field in the previous layers. Various local receptive field from previous convolutional layers generate corresponding feature maps through convolutional operations. All convolutional units with a similar feature map extract data from different parts of the whole images. [38].

For the subsampling or pooling layers, the feature maps are downsampled and reduced the resolution to reach spatial invariance. For example, in Figure 2.4, the subsampling S2 downsamples feature maps from C1 (28x28) to 14x14, the subsampling S4 downsamples feature map to 5 x 5 from previous convolution layers. Two of typical subsampling methods are average pooling and max pooling. The conceptual of these subsampling methods are shown in Figure 2.5.

After a sequence of convolutional and subsampling layers, the last layer is the classification layer, which is responsible for performing the classification tasks.

All the weights of the CNN are trained from the back-propagation process, and CNN can be considered as a model that can synthesize its own feature extractor [38]. Based on the mechanism of the first CNN architecture, several state-of-the-art deep CNN architectures are discussed in the next part.

(16)

7 1 2 8 4 2 8 8

1 3 8 7

2 2 0 9

2 4 6 6

8 3 9

8 Average pool-

ing 2x2 kernel with stride = 2

Max pooling 2x2 kernel with stride = 2

Figure 2.5. The conceptual of these subsampling methods

2.1.3 CNN network architectures

This section introduces state-of-the-art CNN architectures that are mostly used for image denoising as well as the proposed CNN architectures implemented for image denoising.

2.1.3.a VGGNet Network

VGGNet [56] is a deep neural network that achieved the top positions the Image Net Large Scale Visual Recognition Challenge 2014 (ILSVRC2014) [53] and can show the positive relationship between the great depth of the network with the performance of the model as increasing the depth of neural models achieves better model’s accuracy.

The fundamental unit of VGGNet is stacked by one or multiple convolutional layers, followed by activation function such as ReLU as shown in Figure 2.6.

The sequence of such fundamental units followed by max pooling layer form a VGG block as shown in Figure 2.7.

Finally, A basic VGGNet architecture is built from VGGNet blocks, ends with dense layers followed by soft-max function [56]. The VGGNet design concepts have inspired further typical deep CNN architecture such as ResNet [28], DnCNN [72].

(17)

Activation function

Convolution

...

Convolution

Figure 2.6. The fundamental unit of VGGNet.

Max pooling

Convolution

...

Convolution

...

Convolution

...

Convolution

Figure 2.7.A simple VGGNet block.

2.1.3.b Network in Network

Network In Network (NIN) is a proposed neural structure that allows discriminate models to extract more features in local receptive fields and prevent overfitting phenomenon [42]. NIN can be considered as a micro-network constructed by Multi Linear Perceptrons between the local patches of the previous neural layer and the corresponding convolutional output. As shown in Figure 2.8, the convolutional layer locates in the middle and directly gets features from the previous layer to pass to the next layer, where in Figure 2.8.b, there is a micro-network between the next convolutional output and the previous filter. The idea of inception networks that is introduced in the next part is highly inspired by NIN architecture.

(18)

2.1.3.c Residual Learning Network

The purpose of the residual learning network is to tackle the main problems of training great deep neural networks such as vanishing/exploding problems. Such problems make gradient values significantly small/big which negatively affects the training accuracy [28].

Residual learning network allows the model to scale up by stacking a significant number of network layers while maintaining the learning capacity of the training model. Instead of directly stacking multiple layers on top of each other, the residual approach uses a shortcut connection to map the preceding layer with one or multiple previous layers. De- noting the input as x, layers asF and the output asy, the output of residual network is y=F(x) +xinstead ofy=F(x)as described in Figure 2.9.a and 2.9.b. It is also impor- tant to ensure that the output and the input of residual layers have the same dimensions to allow the shortcut connection perform linear operations between layers [28].

Residual learning concept significantly contributes to very deep advanced CNN architectures such as DenseNet [31], DnCNN [72], Inception CNN [60], etc.

2.1.3.d Inception networks

Inception Network has gained in popularity since 2014 as this deep convolution network called "Inception" achieved the top position in the ILSVRC 2014 competition [53]. Incep- tion networks allow the models to significantly reduce the number of training parameters while maintaining the depth of networks [60]. The main purpose of the Inception modules is to figure out the wise local spares structure that can be approximately estimated and constructed and repeated by certain neural nodes layers [60]. For the naive inception module as one in Figure 2.10, it shows the solution to exploit the learning capacity by making the layers wider as it concatenates different sizes of kernel 1 x 1, 3 x 3, 5 x 5 and max pooling with a 3x3 kernel. Inception networks are maintained and developed by different versions. Currently, the latest one is Inception-v4 and Inception-Resnet [59].

(a) (b)

Figure 2.8. The tradition convolutional network and convolutional network in the network. [42]

(19)

In this version, the size of each Inception block is uniformly set and the updated versions produce better performance and efficiency as it is shown in Figure 2.11, and Figure 2.12 [7]. In Figure 2.11, the latest accuracies of the most relevant entries submitted to the ImageNet challenge in comparison with other popular architectures. Figure 2.12 shows the computational costs along with the accuracy values of different model architectures.

As shown in these figures, the performance of Inception-V4 is significantly advanced to achieve high accuracy with acceptably consuming computing resources [7].

For the Inception-V4, we focus on Inception-Resnet as this neural structure is utilized for the proposed denoising networks. There are two typical Inception-Resnet structures as shown in the Figure 2.13 In Figure 2.13, there is a residual shortcut connection that links the previous input layer with each inception branch and performs a task of combining the previous layer with a layer aggregated from inception branches. Each branch handles different operations that are composed of stacking of kernels with different filters.

Since the res-inception module implements a residual shortcut to improve training performance [59], it needs to make sure that the input and output after of the operating blocks have the same dimensions and this could be handled by 1x1 convolution. The 1x1 convolution is also used to reduce the size of the dimension of the input before performing the next convolution processes. In Figure 2.13.a, two 3x3 convolutions act as 5x5 convolution as they are stacedk on top of each other. This technique allows the module to improve the learning capacity since the different sizes of kernels are aggregated and prevents

Weight layer

x f(x)

(a)

+

Weight layer

x f(x)

y = f(x)+ x

(b)

Figure 2.9.Typical plain and residual network structures

(20)

Filter concatenation

1x1 convolutions 3x3 convolutions 5x5 convolutions 3x3 max pooling

Previous layer

Figure 2.10.Inception naive module

Figure 2.11. The comparison of Inception-V4 with other popular Neural Networks [7]

losing learning features from reduction dimension processes [61]. In Figure 2.13.b, 1 x n kernel following n x 1 kernel acts as n x n convolution, by this method, the number of parameters is considerably reduced.

2.1.3.e Wide Residual Networks

Diminishing feature reuse which refers to the limitation of learning useful features corresponding to the depth of network is a typical problem of deep residual networks. Sergey [70]

proposes a solution to address diminishing feature reuse of the deep residual networks by widening ResNet blocks of the residual network. Widening the residual network provides more benefits in increasing the depth of networks since it reduced the number of layers, training parameters and time. The proposed method develops by increasing power of the Resnet block [29] typically using filter 3x3 and illustrated in Figure 2.14.

To enhance the learning capacity of Resnet block, three strategies are implemented:

adding more convolutional layers, increasing the number of learning features for convolutional layers, and increasing the sizes of kernels.

The proposed method also suggests a new approach to prevent overfitting by performing dropout technique inside Resnet block. The proposed architecture is shown in Figure

(21)

Figure 2.12. The comparison of the accuracy of popular Neural Networks along with the number of computing operations [7]

2.15.

2.1.4 Advanced techniques of training neural networks

2.1.4.a Data Augmentation

One of the efficient methods to eliminate overfitting for training process is to enlarge training dataset by image transformation techniques. During the training process, the data is slightly modified by real-time data augmentation to increase the amount of learning data. For example, the training image is recreated by horizontal and vertical flipping, rotating, and cropping process. This technique allows the models to have more data to

ReLU Activation

+

ReLU Activation

1x1 conv

1x1 conv 3x3 conv

1x1 conv

3x3 conv

1x1 conv

(a)

ReLU Activation

+

ReLU Activation

1 x 1 conv

1 x 1 conv n x 1 conv

1 x n conv

1 x 1 conv

(b)

Figure 2.13.Typical Res-Inception structures

(22)

xl

conv3x3

⨂

xl+1

Figure 2.14. The basic Resnet block, xl+1, xl are the input and the output of l-th unit in the network. [70]

xl

conv3x3

dropout

conv3x3

⨂

xl+1

Figure 2.15.The basic Dropout Resnet block,x_l+1, x_lare the input and the output of l-th unit in the network. [70]

learn but does not require additional physical computational resources.

2.1.4.b Deep Neural Networks initialization

Initialization of values of neural weights is an initial and a crucial step for a proper training of a neural network [68]. Lacking a proper network initialization causes vanishing/exploding gradient situations or negatively affects training convergence [36]. The general method for initializing neural weights is a small random initialization [26]. However, for the great deep network, it requires more complicated initialization strategies to address typical difficulties of training deep models [28, 55]. In this section several efficient initialization methods, namely, Xavier [21], He [27], Orthogonal [45], LSUV initialization [67], are discussed. Xavier initialization [21] prevents the models from the vanishing/exploding problems since it mitigates the initial values of weights from being too small or too large through different layers. GivenW_i,n_in andn_out as the weights, and the number of

(23)

input and output of the layers, respectively, the neural weights are distributed by Normal Distribution with mean = 0 and the variance calculated as

V ar(W_i) = 2 n_in+n_out,

which is designed for the Sigmoid activation function [46]. By modifying the variance of the neural weights distribution of Xavier initialization, He [27] developed a specific initialization for the networks with ReLU activation function [46] as follows,

V ar(Wi) = 2 nin

.

Orthogonal initialization [45] addresses the typical problems of deep learning networks and allows the models to learn more features from training data based on gradient norm- preserving and de-correlate neural layers properties [45]. Later, an extended version of orthogonal initialization called Layer-sequential unit-variance was introduced in (LSUV) [67]

which implements unit normalization for each layer after performing orthogonal initialization.

2.1.4.c Learning Rates

The learning rate is an essential parameter for the training networks which determines the numeric values that should be added or subtracted from weights for each training epoch [4]. Setting a suitable learning rate can significantly improve training performance since too large learning rate may lead the models to fail to converge or even can diverge and too small learning rate leads training process to be time-consuming [57]. Furthermore, a small learning rate can lead the training process to remain in local minima [3]. Two typical types of optimizing learning rates are learning rate schedule and adaptive learning rate.

For the learning rate schedule, there are three typical methods to set the learning rate schedule: constant, step decay, and exponential decay learning rate. For the first method, the learning rate is set manually. For the step decay, the learning rate will reduce the value by certain amounts after few epochs. For example, for every 5 epochs, the learning rate is reduced by a half or for every 20 epochs, the learning rate is reduced by 0.1 [12].

Regarding the exponential decay learning rate, it follows the formulaa = a₀e^−kt, where a0, kare hyper-parameters andtaccounts for iteration step [12]. A typical drawback with learning rate schedules methods is to use the same updated learning rate values for all parameters, but the adaptive learning rate methods provide potential solutions to address this problem.

The popular adaptive learning rate methods discussed in this part are Adagrad [18], Adadelta [71], RMSprop, and Adam.

For the Adagrad [18] adaptive learning rate algorithm, it sets different learning rates for different parameters. The proposed method works well for spare data since the method

(24)

allows the learning rate to adapt its value based on the rate of occurrence of learning features. The algorithm sets very low learning rates for the high-frequency features;

conversely, it sets high learning rates for less frequently features. GivenGis the diagonal matrix where each diagonal element contains the sum of the squares of the past gradients with respect to θ,G_t contains the diagonal value at the step t,g_tis the gradient value at step t,ϵis the very small value to make sure the value of denominator not equal to zero,

⊙ is the vector product betweenGt and gt . The values of weights for each epoch are updated as [18]

θ_t+1 =θ_t− η

√2

G_t+ϵ⊙g_t

The accumulation of the magnitude in G also accounts for the main drawback of this method. This is because when the next diagonal valueg_tis too large, the learning rates becomes very small or eventually exceedingly small and it leads the training process to be stacked. The disadvantages of Adagrad [18] are addressed by Adadelta [71] method that allows the learning process to continue to make progress even when the number of epochs significantly grows. Given E[g²]t is the running average at time t, p is a decay constant, thenE[g²]tis estimated by

E[g²]_t=ρE[g²]t−1+ (1−ρ)g²_t.

Adadelta replacesGtbyE[g²]tin denominator for update new weights as

∇θ_t=− η

√E[g²]t+ϵg_t.

The next step of Adadelta method is to match the same hypothetical units for each updated weight, this step is handled as

E[∇θ²]t=ρE[∇θ²]t−1+ (1−ρ)∇θ²_t,

denoteRM S[∇θ]_t=√

E[∇θ²]_t+ϵ,RM S[g]_t =√

E[g²]_t+ϵ.The default Adadelta learning rate estimated as RM S[∇θ]_t−1. And finally, the weights for each step are updated as [71]

θt+1=θt−RM S[∇θ]t−1

RM S[g]t

gt.

The next popular method for adaptive learning rate is RMSprop [62]. This method is an unpublished method and proposed by Geoff Hinton [47] in his online course [62].

This method can address the diminishing learning rates problem of Adagrad method.

The running average at time t and the updated values of weights E[g²]_thave the same estimated method as Adadelta as

E[g²]t= 0.9E[g²]t−1+ 0.1g²_t

(25)

and

∇θ_t=− η

√E[g²]_t+ϵg_t.

Adaptive Moment Estimation (Adam) [35] individually estimates the learning rates for each parameter based on the firstm_tand second momentsv_tof the gradient respectively.

Given β1, β2 are the hyper-parameters, gt is the gradient at the time t, then mt ,vt are estimated as

m_t=β₁mt−1+ (1−β₁)g_t, v_t=β₂vt−1+ (1−β₂)g²_t.

The values ofm_t, v_tthen are modified to counteract with these biases by estimating bias correction as

ˆ

m_t= m_t 1−β₁^t ,

ˆ

v_t= v_t 1−β₂^t,

And the final step, the values of parameters are updated by θ_t+1=θ_t− η

√vˆ_t+ϵmˆ_t.

In practice, Adam method [35] gains in popularity in deep learning training optimization and demonstrates excellent results.

2.1.4.d Batch Normalization

Batch Normalization (BN) is a training technique that allows accelerating the training process and maintaining the stability of the deep neural networks since it positively smooths the training optimization landscape [54]. For each training mini-batch, the output of neural model is adjusted by shifting mini-batch mean and scaling mini-batch variance [33] as

BN_γ,β(x_i) =γ x_i−µ_x

√σ_x²+ϵ+β,

wherex_iis the value of each element in mini-batchx.µ_x,σ_x are the mean and variance of mini-batchx,ϵis a small constant that is added to variance to avoid dividing by zero, andγ, βare learner scaling coefficients.

Implementing BN allows setting much higher learning rates for the training process, less consideration about initial neural parameters and also acts as "dropout" layers; therefore, it reduces the training time, makes the training networks more stable and prevents overfitting learning phenomenon [33]. In terms of implementing neural networks for denoising, the combination of BN and residual learning mutual supports each other to boost the performance of neural models and enhances denoising quality [72].

(26)

2.1.5 Typical training problems of deep neural networks

Deep neural networks are challenging to train since there are typical problems such as vanishing/exploding gradients [5], diminishing feature reuse [58] prevent the neural networks to converge [32], this part presents those problems.

2.1.5.a Vanishing gradient

This problem prevents the weights of the networks with a large number of layers from updating the values in backward propagation step as the training gradient become very small. It results in longer time to train the network and negative effects on performance of the whole networks.

2.1.5.b Exploding gradient

Exploding gradient prevents the training networks learn from the data when the the weights of the networks are updated by very large gradients, in the worse case, the weights of the networks are too large and result in overflow issues.

2.1.5.c Diminishing feature reuse

Diminishing feature reuse happens in the forward propagation step where the learning capacity of the later neural layers of the whole network are gradually reduced and contribute less to the network.

(27)

2.2 Image denoising

In the scope of this thesis, the state-of-the-art denoising methods are selected to describe based on three denoising methods, including non-local, learning-based and the combination of non-local and learning-based strategies. The structure of selected denoising categories is shown in Figure 2.16.

(2.3) State-of-the-Art Denoising

(2.3.1)Non-local methods (2.3.1.a) NLM (2.3.1.b) BM3D

(2.3.1.c) BM3D-SAPCA (2.3.1.d) WNNM

(2.3.2) Learning-Based methods

(2.3.2.a) CNN early approach (2.3.2.b) DnCNN

(2.3.2.c) GCBD

(2.3.2.d) Noise2Noise (N2N) (2.3.2.e) MWCNN

(2.3.3) Non-local and Learning-Based methods (2.3.3.a) BMCNN (2.3.3.b) NN3D (2.3.3.c) BM3D-Net Figure 2.16. The selected state-of-the-art denoising methods’ structure.

2.3 State-of-the-Art Denoising Algorithms

Two mainly trendy denoising strategies up-to-date are Non-local self similarity (NLS) and CNN based methods. NLS removes noise by harnessing the redundant features in the natural images and estimating the correlation levels from different blocks in the same images [44]. The approach has been prominent for decades due to their high efficiency in denoising tasks. Where NLM [6] is considered as the first method for denoising based on NLM [6]; BM3D [13], WNNM [25] are the most successful methods that derive from this strategy.

Learning based methods recover degraded images by trainable models normally using given noisy images corresponding to expected output images. In recent years, these methods have been gaining in popularity in denoising tasks as the problems of train-

(28)

ing of great deep neural networks are addressed by advanced training techniques such as residual learning [28], batch normalization [33] and taking the advantages of Graph- ics Processing Units (GPUs). The results of using learning-based methods are highly competitive to NLS based methods [6, 13]. The most popular and significant denoising learning methods are DnCNN [72], GCBD [8], N2N [41], MWCNN [43]

To optimise the advantages and address the limitations of both state-of-the-art non-local and learning based methods, some denoising methods [2, 11, 39, 64] exploit the core idea of non-local methods to integrate with neural networks.

2.3.1 Non-local denoising methods

This section discusses about pioneering non-local based method (NLM) and other powerful denoising methods based on non-local ideas. The selected methods to be discussed are NLM [6], BM3D [13], BM3D-SAPCA [16], WNNM [43] as those methods show excellent performance in terms of image denoising.

2.3.1.a NLM

NLM algorithm exploits the redundancy and similarity from different square blocks in the natural images. The correlation between two centered pixels that belong to different square blocks in the noisy images is estimated based on weighted Euclidean distance which is calculated by

∥v(N_i)−v(Nj)∥²_2,a,

wherei, jare the center pixels of the square vectorN_i, N_j in the imagev,a >0accounts for the standard deviation of the Gaussian kernel.

The targeted pixels that have high correlation with pixelihave higher weight values than average weight. For example, in the Figure 2.17, at pixelp, high correlation pixelsq1, q2 produce the larger weights (w(p, q1), w(p, q2)) than the weightw(p, q3)of less similar pixel q3.

The weightsw(i, j)<1are estimated by

w(i, j) = 1 Z(i)e

−∥^v(Ni)−v(Nj)∥²_2,a

h2 ,

whereZ(i)is the constant estimated by

Z(i) =∑

i

e

−∥^v(Ni)−v(Nj)∥²_2,a

h2

Given the noisy imagesv ={v(i) |i∈ I}, the targeted pixel in noisy images is updated

(29)

Figure 2.17. An illustration of NLM algorithm [6], there are three target pixelsq1, q2, q3 at pixel p. Since the blocks at pixels q2, q3have higher correlation with block at p. The weightsw(p, q1), w(p, q2)> w(p, q3)

and recovered by the mean of the weights of the pixels that have high weight values than average ones as

i=∑

j

w(i, j)v(j).

As an early attempt to exploit non-local strategy for recovering degraded images, NLM has made a significant contribution to inspiring other advanced denoising algorithm methods based on the non-local denoising approach.

2.3.1.b BM3D

BM3D [14] is one of the most effective denoising methods that exploits the correlation between sliding blocks called "block-matching" from degraded images and recovers the given images by denoising filter on 3D matching-blocks .

The first step of BM3D is block-matching; this step is responsible for finding similar blocks in processed images. For example, in Figure 2.18, the similar square blue blocks match with the reference block marked with R. The matching blocks are stacked together as 3D

Figure 2.18. An Illustration for block-matching algorithm [14].

(30)

dimensions based on thresholdT, desired distance dfunction in transform domain. 3D dimensions(N₁, N₁, S_n)are sorted by the level of block’s correlation wherenis the index position andN₁, N₁are the size of stacked blocksS_nthat is extracted from noisy images.

The block is matched and stacked into 3D array if the distance d(normally handled by L²−norm)≤thresholdT. Each block can be attached into single or several 3D groups.

The next step called "collaborative filtering" is responsible for denoising in 3D transform domain, when each stacked block performs 3D transform to obtain the sparse represen- tation and eliminate the noise by hard threshold. Adaptive weight for each block is defined based on non-zero transform values as there are overlapping blocks that are assigned into different 3D groups.

The recovered images are composed of corresponding each block after inverse 3D transform and averaging weights that are created from the previous step. The performance of BM3D can be improved by adjusting distance function and implementing Wiener filter to replace hard-threshold operation.

2.3.1.c BM3D-SAPCA

BM3D-SAPCA[16] is an improved version of BM3D[14]. The core algorithm of BM3D- SAPCA is considerably similar to BM3D; however, there are advanced supplementation methods in "block-matching" step that allows the proposed denoising algorithm to enhance both merits quantitative and visualization. The flow diagram of this method is shown as figure 2.19.

Figure 2.19.A process flow diagram of BM3D-SAPCA. [16].

In the first step "block-matching," each raster scanned pixel in the degraded image is obtained corresponding adaptive-shape neighborhoods using e.g. LPA-ICI [15, 20] at the equivalent center pixel for eight directions. Each adaptive-shape neighborhood is used to find the high correlation ones that are extracted from similarity blocks. The similarity block here is basically estimated by BM3D block-matching [14] method based on corresponding adaptive-shape neighborhood. To identify which adaptive-shape neighborhood layer is

(31)

transformed by PCA and filtered, it uses fixed thresholdT to compare withK as K = Ngr

N_el,

where N_gr, N_el are the number of matched blocks and pixels of given adaptive-shape neighborhood. In this step, PCA method is used to obtain highly correlated adaptive- shape neighborhoods and form them into 3D group to next denoising and aggregating steps. The denoising, aggregating process are basically implemented as the same collaborative filtering and aggregating ones in BM3D. This denoising algorithm can be processed by multiple iterative times to refine its performance [16].

2.3.1.d WNNM

WNNM [25] is an approximation technique that combines with NLS algorithm to perform excellent denoising performance. Similar to BM3D [14] algorithm, this method firstly searches and groups non-local similarity fixed windows in the degraded image into 3D matrix denoted by Yj, whereY is a fixed size patch,j accounts for index position. De- notingX,N are the ground-truth image and residual noise, the observation noise model could show as

Yj =Xj+Nj

X_j can be considered as low rank matrix ofY_j as It is stacked by layers that have high correlation with Yj. Therefore, it is possible to estimateXj from Yj based on low rank matrix approximation methods. Based on theF −normdata fidelity method,Xˆ_j can be estimated as

Xˆj =argminXj

1

σ_n² ∥Y_j−Xj∥²_F +∥X_j∥_w

∥X_j∥_w is the weighted nuclear norm estimated as ∥X_j∥_w = ∑

iwiσi(X), where w = [w₁, ..., w_n]^⊺,(w_i) >= 0is the weight vector, σ_i(X)is thei−thlargest singular value of matrix X. The solution for each block is predicted asXˆj=USλ

2(∑

)V^⊺whereY =U∑ V^⊺ is the singular value decomposition of Y and Sλ

2

is the threshold operator with weight vectorwand calculated bySλ

2(∑

) =max(∑

ii−^λ₂,0)

The final outputX image is aggregated from the whole patchesXˆj.

2.3.2 Learning based methods

This section covers different denoising learning-based approaches that have typical influ- ences on other extended methods or provides promising denoising results in comparison with classical methods. The first learning based method [34] that refers to an early attempt to design the simple neural networks for denoising images is presented. The next selected method is DnCNN [72] which is composed of great deep neural network. We will

(32)

show that it is possible to extend it to obtain excellent denoising results. Other methods:

GCBD [8], N2N [41], MWCNN[43] implement Generative Adversarial Networks (GAN) learning strategy, unpaired training data and wavelet transform to make advancements in image denoising.

2.3.2.a Earlier CNN models for denoising

The earlier proposed neural CNN model for denoising [34] was trained to recover degraded natural images from known and blind noise conditions. The implemented network is composed of four hidden layers with 24 feature maps for each layer. Each feature map randomly connects with 8 feature maps from the next layer and they are convolved by 5x5 filter. The proposed architecture is shown in Figure 2.20 . The network learns from the

input output

I_1,1 I_1,2

I_1,24

I_2,1 I_2,2

I_2,24

I_3,1 I_3,2

I_3,24

I_4,1 I_4,2

I_4,24 ... ... ... ...

Figure 2.20. Early proposed CNN architecture [34] for denoising,I_x,y accounts for corresponding hidden layer columnsxth and rowsyth

training process that aims to minimize the difference between the predicted output and given input images to have the optimal weights’ values. Although this network is much shallower than up-to-date deep CNN models [59, 60, 72], it still achieves comparable or even better performance compared to wavelet-based and Markov Random Field (MRF) based methods proposed in the same decade. This initial approach for image denoising has been inspired by more advanced learning-based methods for the same purposes.

2.3.2.b DnCNN

DnCNN [72] is a residual deep CNN architecture for image AWGN denoising. Given the degraded image y =x + n, where x, nand y are ground-truth and, residual noise and noisy images respectively. The model is residual trained to predict the residual error images ˆn. The output image yˆ is estimated based on the noisy and predicted residual error images (ˆy =x−ˆn). The loss function (L)[72] of proposed model is adjusted to be adapted with residual learning as

L(θ) = 1 2N

N

∑

i=1

∥R(y_i;θ)−(yi−xi)∥²_F

(33)

Here R, θare the residual learning model and referencing training weights. N, iaccount for the number of training samples and corresponding indexes respectively.

A great deep CNN architecture allows the training models to enhance the learning capacity from training images [72]. However, it is hard to train a great deep neural network.

DnCNN significantly addressed typical difficulties of training these model by the combination of residual learning [28] and batch normalization [33] techniques. Those training methods positively contribute to enhancing training speed and denoising performance and also benefits from each other [72].

In Figure 2.21, three types of dense layers are composed to form DnCNN. (i)The first layer (Conv + ReLu) is constructed by convolution operation followed by ReLU. The noisy images are learned through 64 filters of kernel size 3 x 3. (ii) The next hidden layers (Conv + BN +ReLu) also learn from 64 filters of kernel size 3 x 3. BN is utilized between convolution and ReLU processes. The last conv layer is used to construct the predict images.

Conv+Relu Conv+BN+Relu

...

Conv+BN+Relu

...

Conv+BN+Relu Conv

Noisy image Residual error

Figure 2.21. DnCNN architecture.

DnCNN can be trained, performs great results with certain or blind types of AWGN noise.

Additionally, it can be flexibly adapted to develop more advanced denoising CNN models [72].

2.3.2.c GAN-CNN Based Blind Denoiser

GCBD [8] is considered as the first learning-based methods for generating noise model- ing based on GAN [23] approachable strategy. The main goal of this model is to train the neural network to work with unknown noise levels since most of deep neural networks are trained for certain specific ones. The proposed model handles noisy though 2 stages for generating noisy model and denoising.

In the initial stage, GAN module is trained to model blind noises from noisy images and creates the training pairs from generated unknown noisy models and clean images.

In the next stage, the training data generated by GAN is fed into deep CNN to denoise.

(34)

The deep neural model in this approach is considered similar to DnCNN [72]. An overview of the model is shown in Figure 2.22.

Since it is complicated to train the GAN[23] model to adapt with denoising tasks, exploita- tion of GAN for this proposed solution limits on generating the noisy model and creating the pair training data.

Noisy images Clear Images

Noise block extraction

Noise blocks

Generative Adversarial Network

{X, Y}

Convolution Neural Network Convolution Neural Network Convolution Neural Network

Unpaired Data

Figure 2.22. An overview of GCBD architectureX, Y are the paired training dataset. [8]

2.3.2.d Noise2Noise - Learning based denoising method without cleaning data.

Traditionally, neural networks are trained from mapping the given inputs and targeted outputs. However, N2N [41] presents a solution to allow the neural model to learn without referring outputs. In particular, the proposed model has the ability to learn and reduce noise from unpaired training data.

The traditional learning approach is to design the neural networksf_θthat are trained until the lost functionLhas an optimal value. Givenx, yare the set of dependent inputs and corresponding output samples, the training process is equivalent to

argmin

θ E(x,y){L(f_θ(x), y)}.

However, if x and y are independent from each other, the learning purpose of the same network can be shown as

argmin

θ

E(x){E(x|y){L(f_θ(x), y)}}.

The proposed training directs the loss function to minimize the predicted pixel estimation problems separately from corresponding input. Based on detaching the dependence of

(35)

Figure 2.23. The detail architecture of MWCNN [43]

paired training data, the proposed method exploits a strategy for recovering degraded images without using ground-truth images.

2.3.2.e Wavelet-CNN for Image Restoration

The aim of MWCNN[43] is to make the balance between neural learning capacity and computational costs by manipulating images in multiple levels of wavelet transform [43].

It flexibly exploits the U-net neural architecture [51] and residual learning to predict residual noise matrix. The U-net architecture was originally designed for image segmentation tasks but this CNN design also achieves the great results for denoising purpose. MWCNN perform denoising task through 2 stages including decomposition and reconstruction processes.

In the first stage, sample data transferred and processed at different levels. For each level of the wavelet transform, the input image is downsampled into the specific number of sub-band images as the convolution results with filters followed by the downsampling process. For the next level, the sub-band image repeatedly decomposed for the same process. Higher multi-level wavelet is recursively obtained and utilizes the CNN neural network for each one.

In the construction stage, the decomposition process is reversed by the corresponding convolution and upsample process. The output of the model is the residual noise matrix.

The final prediction is estimated based on degraded and residual noise images. The detail of the Wavelet-CNN model is illustrated as Figure 2.23.

Regarding visualization perspective, this proposed method can enhance the detail patterns and sharp structures from the noisy image as the advantage of implementing subsampling and learning from enlarging receptive field. Quantitatively, MWCNN shows a compatible performance in comparison to other state-of-the-art denoising methods.

(36)

2.3.3 Non-local and learning based methods

Exploiting the combination of non-local and neural networks is a highly potential approach for improving denoising performances. This part presents several denoising methods based on integrating non-local and learning-based methods including BMCNN, NN3D and BM3D-Net.

2.3.3.a BMCNN

BMCNN [2] is a denoising method that aims to take the advantages of both NLS and CNN. There is a considerable element about the proposed approach is that it uses both similarities of noisy and denoised patches that separating by block-matching algorithm in the noisy and denoised images to feed the neural network. This mixing strategy supports the prevention of the reconstructed images to lose their details. The flow diagram of this algorithm is shown in Figure 2.24. (BMCNN) model is composed of three stages

Noisy Image

Preprocessing

Pilot Image

Result Image

Noisy Patch Blocks

Plot Patch Blocks Block Matching

Aggregation

Concatenate Input Blocks

Denoised Patches

Denoising Network

Figure 2.24.The flow of BMCNN

with residual learning, batch normalization techniques that are responsible for extraction, refinement, and reconstruction images’ features. The neural model is trained to predict the residual noise distribution for each patch by residual learning technique, then the predicted image is aggregated from denoised patches that are produced in the previous step. The general architecture of model is showed in Figure 2.25

Figure 2.25.The general architecture of BMCNN

(37)

The proposed BMCNN method has combined the advantages of non-local and learning- based techniques for the enhancement of the visualization performances including irregular and repetitive structures.

2.3.3.b NN3D

NN3D [11] is a powerful denoising method that is inspired from both NLS and learning- based methods. The pre-trained neural network can perform excellent results for image denoising. However, reconstructing the degraded images by pre-trained network filter may cause hallucination effects, the proposed method implements non-local filter after denoising by the neural networks to address hallucinated phenomena and improve the quality of recovered images. When a non-local filter is applied to smooth out hallucinated effects, the processed images might be oversmoothed. The proposed denoising method addresses the mentioned drawbacks by iteratively performing the pre-trained neural network followed by a non-local filter. Given k is the number of iteration steps,λis the con- trolled rate parameter which is gradually decreased corresponding to k-th values; y, z,ˆ zˆ are the current output, current original input and previous estimate values, respectively.

For each iteration, the outputyˆat k step is estimated asyˆ_k =λ_kz+ (1−λ_k)ˆyk−1. The flow diagram of the proposed method is shown and described in Figure 2.26. This proposed

Figure 2.26.The flow of NN3D [11], the denoising task is performed through k th iterators.

The current input for CNN-based filter (CNNF) iszˆ_kwhich is the combination of the noisy imagezand the previous estimatey−ˆ 1asyˆ_k =λ_kz+ (1−λ_k)ˆy_k−1. The nonlocal filter (NLF) handles the outputy˜_kand produces the estimated outputyˆ

.

method works well for different noise levels and neural networks. Since pre-trained neural network and non-local filters recover degraded images side-by-side, NN3D can produce improved performance by integrating pre-trained neural network and preserving image details [11].

(38)

2.3.3.c BM3D-Net

BM3D-Net [69] is a network inspired by BM3D and allows BM3D algorithm to have the benefits from the learning capacity of a neural network. The proposed neural network is composed of 5 different layers including extraction, convolution, nonlinear transform, convolution, aggregation layer. The architecture of this approach is shown in Figure 2.27

Figure 2.27.The general architecture of BM3D-Net

The first layer extraction is responsible for exporting self-similarity blocks, followed by stacking those blocks into a fixed 3D array then passing to the next convolution layer.

The two following steps are convolution layer and non-linear transform that acts as a collaborative filtering of BM3D. In these steps, the weights values of neural networks are trained from the matching group in learnable transform domain followed by nonlinear transform operations. The next convolution step takes responsibility for training the neural weights to convert feeding data from transform into the spatial domain. The last step, the recovered image is aggregated from predicted patched images [69].

2.3.4 Comparative analysis of Denoising methods

NLS and learning based methods have different approaches for denoising. For NLS based measures, they recover degraded images based on analyzing, processing and aggregating patches similarly in the same images instead of mapping between training input and output as learning based methods.

Both methods can recover for general cases; however, NLS is better at performing on noisy images with frequent and repeated patterns while learning based methods are favorable on irregular ones [2]. Combining both strategies can exploit the strengths of each method and therefore improve denoising performances.

Regarding those denoising strategies, each method also has their limitations. For NLS, it is difficult to find the optimal parameters, parallel core algorithms and implementation as the algorithms involved in complex optimization problems [2]. For certain noise levels, the state-of-the-art based on this approach such as BM3D, WNNM are outperformed by learning based methods [11, 72]. In terms of learning based methods, these methods lack flexibility in practical perspective since the neural networks need to be trained

(39)

with specific types of data and certain noise types and levels. Therefore, it is challenging to solve arbitrary noise problems. Additionally, learning based methods might produce unwanted phenomena such as hallucination [11]. In fact, the method based on NLS, BM3D achieves better denoising performance on real images than modern deep learning methods that are trained on Gaussian noise [50]. The next drawback with the learning based method is the expensive training processes regarding both computational resources and technical aspects. Lack of computing devices such as powerful GPU can be a major obstacle in the training processes and it requires training techniques such as batch-normalization, residual learning, initialization neural parameters to make the models stable and to avoid undesirable problems such as vanishing/exploding gradients [22].

While NLS denoising methods have been extensively exploited for decades, learning- based approaches have just been drawn more attention recently and there is a high potential for learning-based methods to achieve further promising performance.

(40)

3 PROPOSED CNN BASED DENOISING METHOD:

FLASHLIGHT CNN ARCHITECTURE FOR IMAGE DENOISING AND IMPLEMENTATION

This chapter introduces a learning-based method for image denoising called FLCNN that employs very deep CNN built using a combination of residual and inception layers. The proposed method is designed to improve the learning capacity by addressing typical deep neural network training problems that arise when the number of training parameters are significantly larger than those of the base-line networks composed of only residual neural layers. This section provides general concepts about the proposed learning based method and implementation procedures.

3.1 Proposed network background

Numerous successful learning-based methods for image denoising are based on DnCNN [72].

However, it is inefficient for those networks to improve their performances by increasing the number of learning parameters because of diminishing feature reuse problem that limits the contribution of last several neural networks. The neural network proposed in this thesis work FLCNN is inspired by DnCNN and Inception-Resnet [59] architectures.

It uses skip connections to maintain the learning capacity of very deep neural networks and inception layers to increase the number of total parameters in a way that translates to improved overall performance. Our approach manages to overtake the performance of DnCNN architecture and enhance the limitation of DnCNN based networks by overcom- ing diminishing feature reuse and increasing the receptive field of the network.

3.2 Proposed network architecture

FLCNN aims to denoise gray-scale images corrupted by AWGN. It is composed of two cascaded phases: warm up and boost. Warm up is made up of regular convolutional layers, while the boost phase is created by stacking customized inception layers.

Warm up is illustrated in Fig 3.1. The warm up stage is composed of sequential layers and each layer generates 64 feature maps. The first few layers use 3x3 kernels, and

Deep Neural Networks for Image Denoising