Deep learning for retinal image segmentation

(1)

Lappeenranta University of Technology School of Engineering Science

Degree Program in Computational Engineering and Technical Physics Intelligent Computing Major

Azat Garifullin

DEEP LEARNING FOR RETINAL IMAGE SEGMENTATION

Examiners: Professor Lasse Lensu Professor Igor Gurov

Supervisors: Professor Lasse Lensu Professor Igor Gurov

(2)

ABSTRACT

Lappeenranta University of Technology School of Engineering Science

Degree Program in Computational Engineering and Technical Physics Intelligent Computing Major

Azat Garifullin

Deep learning for retinal image segmentation

Masters’ Thesis 2017

57 pages, 19 figures, 3 tables Examiners: Professor Lasse Lensu

Professor Igor Gurov

Keywords: Bayesian deep learning, image segmentation, spectral fundus imaging, blood vessels segmentation

A lot of eye diseases can be diagnosed through the characterization of the retinal blood vessels. The characterization can be done using proper imaging techniques and data analysis methods. The spectral fundus imaging is an approach providing hyperspectral images, where each channel corresponds to a certain wavelength. The spectral information gives additional information compared to the colour images which might be more useful for automatic data analysis since it contains more accurate information about reflectance spectra of the sample. This work studies the application of deep convolutional network for the retinal blood vessels segmentation, the comparison of the system for the ordinary colour and hyperspectral images is given, the experiments with the uncertainty estimation are provided.

(3)

ACKNOWLEDGEMENTS

This thesis has been written as a part of my double degree studies in the collaboration of Lappeenranta University of Technology and ITMO University. Thus, I would like to thank Machine Vision and Pattern Recognition laboratory and Computer Photonics and Digital Video Processing department for interesting courses, creating comfortable work conditions and financial support during my master’s studies. Especially I would like to express my gratitude to my supervisors Lasse Lensu and Igor Gurov for supporting and guiding me during the research projects. I would also like to thank MVPR laboratory for the hot sauna and cold beer.

I am grateful to my friends and colleges in Lappeenranta, Saint-Petersburg and Bashkor- tostan for supporting me and creating a nice atmosphere in any place of the world.

To my big family in Bashkortostan.

(4)

LIST OF SYMBOLS AND ABBREVIATIONS

ACC Accuracy

AE Autoencoder

ANN Artificial Neural Network

AUC Area Under the Curve

BLDE Balanced Local Discriminant Embedding

CCD Charge-Coupled Device

CD-k Contrastive Divergence k

CE Cross Entropy

CLAHE Contrast Limited Adaptive Histogram Equalization

CNN Convolutional Neural Network

DBN Deep Belief Network

DNN Deep Neural Network

DiaRetDB2 Diabetic Retinopathy Data Base version 2

DR Dimensionality Reduction

DRIVE Digital Retinal Images for Vessel Extraction

FOV Field-Of-View

KLD Kullback-Leibler Divergence

kPCA kernel Principle Component Analysis

MC Monte Carlo

OCT Optical Coherent Tomography

PCA Principle Component Analysis

RBM Restricted Boltzmann Machine

RGB Red-Green-Blue

ReLU Rectified Linear Units

SAE Stacked Autoencoders

SE Sensitivity

SGD Stochastic Gradient Descent

SP Specificity

STARE STructured Analysis of the Retina

(7)

1 INTRODUCTION

1.1 Background

Evidence-based medicine is a common practice in many subfields of medical science where the diagnostic process is based on evidence obtained by objective examination of the patient through biomedical measurements. The examinations are typically performed using noninvasive control methods [1]. Noninvasive control methods are important and rapidly developing branch of science and technology which can be used for medical examinations and early diagnostics of eye diseases [2], [3]. In addition to the examination methods, there is also a big demand on automatic data analysis methods that can help experts in early diagnostics [4].

The most common approach for the eye examination is the colour fundus photography [5].

The imaging setup is a low power microscope with an attached camera. The resulted images are RGB (red-green-blue) images of posterior pole, retinal vasculature, optic disc and macula. The colour fundus photography is widely used for studying diabetic retinopathy, age-related macular degeneration and cardiovascular diseases [5].

Another more advanced technique for the eye examination is the optical coherent tomography (OCT) [6]. OCT is a technique based on the low-coherence interferometry, and it allows to investigate samples of multiple tissues by gathering information about the reflectance of each tissue [7]. The resulted OCT signal has a harmonic or quasi-harmonic nature, and information about the reflectance is hidden in the amplitude of the signal.

Consequently, the resulted tomogram has to be reconstructed from the input signal by the envelope estimation which can be the difficult task itself [8]. Another important disadvantage of the OCT systems is the expensiveness of the imaging setups and accessories [9].

The typical OCT systems may cost tens of thousand dollars [10]. Thus, there is a demand in cheaper imaging setups for noninvasive eye examination.

Another approach for the eye investigation is a spectral fundus imaging [4]. Whereas ordinary fundus cameras give RGB or grayscale images, spectral fundus imaging systems result in hyperspectral images. The hyperspectral images are images where each channel

(8)

corresponds to particular spectral band, i.e., each pixel of the image contains the information about the reflectance spectrum of the sample. It is well known that different chemical substances have different reflectance or absorbance spectra, so additional features for an analysis are available in this case [11]. This fact, inexpensiveness and ease of use made the hyperspectral imaging a popular technique for the nondestructing control [11].

Apart the imaging techniques, proper technologies have to be chosen for the data analysis in order to automate the diagnostics process. In case of eye examination, one of the important tasks is the retinal image segmentation. In the last two decades, a lot of algorithms for the retinal image segmentation have been proposed [12], [13], [14]. Most of them concentrate on the blood vessel segmentation, since proper characterization of the blood vessels plays significant role in different eye diseases diagnostics. The majority of the proposed algorithms are based on classical machine learning schemes where handcrafted feature extraction methods are used and trainable classifiers utilize handcrafted features to get the segmented image. The main limitation consists in the feature engineering which is a difficult, expensive and time-consuming process.

Recent developments in artificial neural networks (ANNs) and deep learning provide an effective way for feature learning [15]. In the last decade, there have been a lot of publi- cations about deep learning applications in the field of machine vision [16] and medical image analysis [17]. Recently numerous papers describing deep learning based approaches for the colour retinal image segmentation [18], [19] have been published, whereas there have been the significant number of developments in the field of hyperspectral image segmentation [20], [21], [22], [23], [24]. To the best of the author’s knowledge, this thesis is the first work which studies application of the deep learning techniques for the retinal blood vessels segmentation using spectral fundus images.

1.2 Objectives and delimitations

As a whole, the system for the eye examination may consist of a spectral fundus camera which provides hyperspectral retinal images of a patient’s eye and the software which builds a label map of the fundus where each label corresponds to a certain object in the

(9)

fundus. Thus, the goal of this work is to implement a deep learning system for the retinal blood vessel segmentation which may be helpful for experts to make a right diagnosis and to compare the performance of the segmentation system for hyperspectral and colour images.

This thesis is focused on the blood vessel segmentation for spectral and colour fundus images only. Segmentation of other objects in fundus images, estimation of the blood vessels characteristics and its usage for the eye diseases diagnostics are not the goal of this work.

1.3 Structure of the thesis

The thesis is organized as follows. Chapter 2 gives the background of eye structure, uti- lization of the blood vessel characteristics for the eye diseases diagnostics and spectral retinal image acquisition technique. Chapter 3 describes the semantic image segmentation problem statement, a brief theoretical background in deep learning, a brief analysis of existing algorithms based on the deep learning and neural networks for the retinal image segmentation and hyperspectral image analysis. Chapter 4 contains a description of the implemented architecture. Chapter 5 provides the description of the system evaluation and experimental results. In Chapter 6, a consideration of the implemented system, obtained results and possible future research is presented. Conclusions are given in Chapter 7.

(10)

2 FUNDUS IMAGING

2.1 Eye structure

The eye is an organ of sight which typically has a spherical form and located in an orbital cavity. The human eye has a complicated structure which is presented in Fig. 1a. Usually three layers of the eyeball are distinguished: the outer fibrous layer, the middle vascular layer, and the inner nervous tissue layer [1].

(a) (b)

Fig. 1. (a) Eye structure [1]; (b) Fundus structure [25].

The outer fibrous layer consists of sclera, cornea, and limbus. The sclera is an opaque, fibrous outermost layer which preserves the shape of the eyeball and attaches it to the extraocular muscles. The cornea is a transparent, ellipsoid refractive surface. The limbus is a part which connects the cornea and the sclera [1].

The middle vascular layer contains the iris, the ciliary body and the choroid. The iris is a diaphragm that separates the anterior chamber from the posterior chamber. The iris contains and controls an aperture which is called the pupil. The iris is connected with the middle of the base of the ciliary body which consists of ciliary muscles that control the form of the lens and the ciliary epithelium which yields the aqueous humour that is responsible for the delivery of the oxygen, and nutrients to the lens and cornea. Another important function of the aqueous humour is a metabolic waste removal from the lens and cornea. The choroid is a dark, brown layer which consists of the connective tissue and

(11)

provides the oxygen and nutrition to the inner nervous tissue layer [1].

The inner nervous tissue layer consists of the retina, the optic disc, and the optic nerve.

The retina is a light-sensitive tissue which is composed of the ten nervous layers, nerve fibres and the pigmented epithelial layer. In the posterior part of the retina, the macula lutea is located which is a pigmented area that consists of photoreceptors with the high acuity (cones). The optic disc is a circular disc which consists of the nerve fibre layer.

Since there are no light-sensitive cells in the disc, it is also known as a blind spot. The optic nerve is a nerve that extends from the optic disc and transfers the visual information from the retina to the brain [1].

The fundus photography is one of the common for techniques for the eye examinations [1].

Particularly this type of examination can be used for the diabetic retinopathy diagnostic [3]. Fig. 1b presents the image of the fundus. From the figure macula, optic disc, and blood vessels can be clearly seen.

2.2 Blood vessels characterization

Since a lot of eye diseases may influence the blood vessels in many different ways, blood vessel characteristics may give the evidence about numerous diseases [1]:

1. Eale’s disease is an inflammatory peripheral retinal vasculopathy which has the following signs: tortuous and thick veins and neovascularisation.

2. Retinal Telangiectasia is a macular disease caused by retinal exudation. In case of this disease vessels might be widened and tortuous.

3. Hypertensive retinopathy is a vascular disease caused by high blood pressure (hy- pertension). Several grades of this disease can be distinguished depending on the types of arteries and veins crossings and width of retinal arterioles.

4. Diabetic retinopathy is one of the most serious complications caused by diabetes mellitus, and it is one of the most often reason of blindness. Different types of diabetic retinopathy can be distinguished depending on the vascular changes. In Eva

(12)

Kohner’s classification, these types can be distinguished depending on the presence of venous dilations, loops, and varicose formations or neovascularisations.

5. Pigmentary retinal dystrophy is a degenerative retinal disease which affects on the rods and cones. Blood vessels in this case become attenuated and thread-like.

Vascular lesions may also give an evidence about the presence of different diseases, e.g., central retinal artery occlusion is a lesion related to the blocked blood flow (occlusion), and it may be connected with arteriosclerosis, atherosclerosis and Buerger’s disease [1].

Another common occlusion is retinal vein occlusion which often comes with arteriosclerosis, atherosclerosis and diabetes [1].

Thus, the shape, size, arteriovenous ratio and arteriovenous crossing types can be used to get the evidence about the numerous eye diseases. In order to obtain these characteristics, it would be beneficial to have the segmented images where each pixel has a label which indicates whether this pixel belongs to the vessel or not. The problem of retinal blood vessel segmentation is discussed in Chapters 3 and 4. Further, the information about the imaging setup utilized to obtain the DiaRetDB2 dataset is given.

2.3 Spectral fundus imaging

Numerous spectral fundus imaging setups have been proposed. The typical setup is an adapted fundus camera with a light source with a broadband illumination, and a spectral device for selecting a spectral band. Fält et. al. [26] modified a Canon CR5-45NM fundus camera to spectral fundus camera by replacing the standard light source with the fibre optic illuminator with the halogen lamp with illumination spectrum from 380 to 780 nm and 30 interference filters with 10 nm step were used for the wavelengths selection. As a detector, grayscale charge-coupled device (CCD) camera with array size2048 ×2048 pixels and 2×2binning was used. The imaging setup is presented in Fig.2a

In the process of fundus imaging, the light from the halogen lamp is guided by the fibre to the fundus camera where its spectra is reduced to the particular band by the interference filters. After that, the light with the selected wavelength range reaches the eye through the

(13)

(a) (b)

Fig. 2. (a) The spectral fundus imaging setup; (b) An example of the hyperspectral retinal image.

The image was normalized for the visualization purpose. [4]

system of lenses and mirrors inside the fundus camera. The light reflected from the fundus then arrive to the camera and it is projected to the CCD sensor. Filtering the incident light to the fundus allows to avoid possible photochemical injuries. To increase the amount of the incident light, the patient’s pupil is usually widened by tropicamides.

2.4 DiaRetDB2 dataset

The resulted dataset (Fig. 2b) is a set of1024×1024images array with the 30 channels where each channel corresponds to the certain wavelength. For each image in the dataset the field-of-view (FOV) mask is provided. The FOV masks are binary images where white areas correspond to regions of the fundus of the eye. The dataset utilized in this work contains images with diabetic retinopathy lesions which are presented in Fig. 3.

The key problem in automatic retinal images analysis is the image segmentation task which should be solved and evaluated using accurate ground truth vessels markings provided by the experts. Initially there were no ground truth data available for the segmentation task, and the author had to mark all the blood vessels himself using the provided data and not being an expert in ophthalmology. The examples of the FOV masks, vessels markings and corresponding colour images are given in Fig. 4.

(14)

(a) (b) (c)

(d) (e)

Fig. 3. Symptoms of diabetic retinopathy: (a) Microaneurysm; (b) Haemorrhages; (c) Neovascu- larisation; (d) Soft exudate; (e) Hard exudates [3].

(a) (b) (c)

Fig. 4. (a) An RGB image from the DiaRetDB2 dataset; (b) An example of the FOV mask where white pixels correspond to a region of interest; (c) An example of the vessels markings where white pixels correspond to vessels.

(15)

3 ARTIFICIAL NEURAL NETWORKS BASED SEMAN- TIC SEGMENTATION ALGORITHMS

3.1 Semantic segmentation

The segmentation is a problem of separation of an image into similar parts. A similarity can be defined as a similarity between colours, textures or other features. The segmentation where the same label is assigned to semantically same objects is called a semantic segmentation.

In the case of retinal images, segmentation may refer to the construction of a map where each pixel is marked with a label that may correspond to a certain object, e.g., vessels [18], optic disk [27] or exudates [28]. This work is concentrated on the blood vessels segmentation, i.e., a system for given images similar to Fig. 4a should produce label maps similar to Fig. 4c. In the case of hyperspectral image segmentation algorithms, the terms classification or pixelwise classification are often used [22]. Typically, semantic segmentation algorithms are based on supervised learning techniques, i.e., a dataset with ground truth data is available and a loss function is minimized to achieve a satisfactory performance.

The classical approaches for the semantic segmentation use the handcrafted features with supervised trainable classifiers. The feature engineering is a difficult and time-consuming process, and it is not always obvious how to extract useful features from data. Recent developments in machine learning provide different approaches for a feature learning [29], [15]. Not only do these techniques train classifiers, but also train feature extraction methods, and the majority of the state-of-the-art algorithms use deep learning techniques for the purpose, since it is often claimed that deep learning is the universal approach [30].

Another reason is the ease of implementing deep models, since currently a lot of different modern deep learning frameworks are available [31].

(16)

3.2 Deep neural networks

3.2.1 Neural networks

Artificial neural network is a parametrized functionf(x, θθθ)which can be represented as a composition of functions which are called layers

f(x, θθθ) = f⁽¹⁾◦f⁽²⁾◦. . . f^(L)

(x, θθθ), (1)

wherex is an input of the network,θθθ represents parameters of the model, ◦ denotes the functions composition, f^(L) is a layer with index L, and Lis the total number of layers of the network. The layers may be also parametrized, and the parameters of the model are then represented by a set of parameters of the layersθθθ =

θθθ⁽¹⁾, . . . , θθθ^(L^p⁾ ,whereL_p is the number of trainable layers. Typically ANNs have different types of layers, and the particular type of layer is chosen depending on the problem. The particular set of layers connected in the particular order is called an architecture.

The classical architecture is a feedforward network which consists of stacked fully connected layers. The output of each layer can be defined as follows [15]:

f^(l)(x, θθθ^(l)) =a^(l)(W^(l)x+b^(l)), (2) where ais an activation function, W^(l) is a weights’ matrix, b^(l) is a bias vector,θθθ^(l) = {W^(l),b^(l)}represents the parameters of layer.

The activation functions usually introduce nonlinearity to the system. Variety of activation functions exist. One of the most used activation functions is rectified linear unit (ReLU) [32]:

a(z)_i =max(0,zi) (3)

which is a piecewise linear function, and it is equal to zero for the negative input values z and equal to the input value otherwise. A common choice for layers with categorical outputs is the softmax activation [32]:

a(z)_i = exp(zi) P

jexp(zj) (4)

(17)

which gives a categorical distribution of outcomes for the given input. In the case of binary classification, sigmoid activation function is often used [32]:

a(z)i = 1

1 +exp(−zj). (5)

The learning of the network is usually performed using gradient-based optimisation methods through the minimization of the negative log-likelihoodL_D [32]. For a given training set represented by a set of inputsX =

x1, . . . ,xN_training and a set of desired outputs Y=y1, . . . ,yNtraining and described by some empirical distributionpˆ_data, the optimization problem can be formulated as follows [32]:

θθθˆ=arg min

θθθ

L_D(θθθ), (6)

L_D(θθθ) = −Ex,y∼ˆp_datalogp_model(y|x, θθθ), (7) where p_model is a distribution described by the model. Depending on the problem, different distributions can be used. In the case of semantic segmentation, cost functions for binary and categorical outputs parametrized by Bernoulli and categorical distributions are typically used. The resulted cost functions are then equivalent to the binary cross entropy and categorical cross entropy, respectively [32]:

L_D(θθθ) =−E^y∼ˆp_data,ˆy∼ˆp_model[ylogyˆ+ (1−y)log(1−y)]ˆ , (8) L_D(θθθ) =−Ey∼ˆp_data,ˆy∼ˆp_model

"

X

c

yclogˆyc+ (1−yc)log(1−ˆyc)

#

, (9)

whereyˆandyˆare the outputs of binary and categorical classifiers, respectively, andcis an index over the categories.

The minimization task in Eq. 6 can be performed using the gradient descent algorithm [32]:

θθθt+1 =θθθt−∇θθθtLD(θθθt) = (10)

=θθθt+Ex,y∼ˆp_data∇θθθtlogp_model(y|x, θθθt), (11) where is a learning rate which controls the speed of the gradient descent and t is an index of the current gradient descent iteration. The minimization (Eq. 7) and the gradient calculation (Eq. 11) requires calculation of the expected log-likelihood through the

(18)

whole dataset which can be computationally very expensive. Instead of calculating the expectation over the whole dataset, the expectation over some small group (minibatches) of examples can be calculated. This leads to the stochastic gradient estimate, and the minimization algorithm using the gradient estimate obtained using the minibatch expectation is called minibatch gradient descent. It is also possible to use a single sample for the gradient estimation, and in this case the minimization algorithm is called stochastic gradient descent [32]. Thus, the update formula for the stochastic optimisation can be written as [32]

θθθ_t+1 =θθθ_t− ggg_t, (12)

gggt=

NM

X

i=0

∇θθθtlogp_model(yi|xi, θθθt) (13) wheregggis a stochastic gradient estimate andN_M is the number of samples used to approximate the gradient.

In 1957, Kolmogorov [33] proved that any function can be approximated by a superposition of continuous functions of one variable and addition, and later Cybenko [34] proved a similar theorem for the sigmoid function. Thus, nowadays it is often claimed that neural networks are universal approximators, i.e., it is possible to approximate any continuous function by the neural network [30] with a certain number of layers. In many cases it is necessary to approximate very complicated functions, and for this purpose neural networks with a large number of layers are used. These neural networks are called deep neural networks (DNNs).

Training the output layers and calculating the gradient at the output layer are trivial, but it is also necessary to optimize the parameters of the intermediate (hidden) layers. In order to calculate the gradient at the hidden layers, the algorithm known as backpropagation is usually used which is just an application of the chain rule for derivatives for computational graphs.

Historically, first deep neural networks suffered from the vanishing gradients problem which consists in the exponentially decreasing gradients with the increasing distance from the output layer [15]. This problem is caused by the application of the certain activation functions such as the sigmoid function or the hyperbolic tangent function which map the

(19)

input to a very small range. Consequently, the gradient of the activation changes slightly even if the changes in the output are significant. Historically the greedy pretraining techniques (described in 3.2.3) were utilized to avoid the problem. Nowadays the common way to overcome the problem is to use ReLU activation (Eq. 3) and batch normalization (described in 3.2.4) [32].

3.2.2 Convolutional neural networks

Another useful type of the feedforward neural networks is a convolutional neural network which is a type of the architectures inspired by the human visual cortex. Whereas the fully connected layers perform a composition of an activation function and the weighted sum of inputs, the convolutional layers perform a composition of activation function and the convolution of inputs with trainable kernels. Thus, the convolutional layer is a trainable bank of filters with some activation, and the output of the layer is a set of feature maps.

The feature maps can be calculated with the following formula [32]:

f_k^(l)(x, θθθ^(l)) = a^(l)(W^(l)k ∗x+b^(l)k ), (14) wherek is an index of the feature map, and∗is the convolution operator.

The training of the convolutional neural networks (CNNs) is almost the same as the training of the fully connected layers except the gradient calculation in backpropagation. As a result of the training, it is possible to obtain hierarchical feature extractors, e.g., the first layer may extract low level features such as edges, and the further layers may extract corners, textures and complicated object-specific features.

Quite often after the convolutional layer follows a max-pooling layer which reduces the size of the feature map by dividing the feature map into blocks and replacing the whole block with its maximum value.

3.2.3 Greedy pretraining

As it has been mentioned above, the neural networks can be trained in the supervised manner. Nevertheless, it is a common situation that a part of the dataset is unlabeled, and,

(20)

consequently, it cannot be used for the training purposes. One of the common solutions to the problem is greedy unsupervised layerwise pretraining [32].

The principle behind the greedy pretraining is an unsupervised training of the network so that it could reconstruct the input from some hidden representation. It is called greedy because the training is performed layer-wise, and the cost of only the current layer is minimized [32]. The two most commonly used architectures for the greedy pretraining are restricted Boltzmann machines (RBMs) and autoencoders (AEs).

The Restricted Boltzmann machine can be described as a Markov random field which is defined in a bipartite graph such that one of the disjoint sets represents the input variables (visible), and another one represents the hidden variables. Since the RBM is a Markov random field, a joint distribution over the hidden and visible variables can be defined through the Gibbs distribution with an energy function [15]:

U(x,h) =X

j

φ_j(xj) +X

i

ξ_i(hi) +X

ij

η_ij(hi,xj)), (15) where the particular form of binaryηand unaryφ, ξpotentials depends on the type of the input and hidden variables. The training is usually performed through the minimization of the free energy function [15]:

UF(x) = −logX

h

exp −X

ij

ηij(hi,xj)

!

. (16)

The free energy is usually minimized through the stochastic optimisation algorithm based on the Gibbs sampling which is called contrastive divergence (CD-k).

The RBMs can be stacked into a deep belief network (DBN). The training is performed by unsupervised training of the RBM on the input data, and then the next RBM is trained on the outputs of the previous RBM, and this process can be repeated recursively until the desired number of the layers is reached [35].

Autoencoders (AEs) are neural networks which consist of two parts: encoder which con- structs a hidden representation of the input, and decoder which reconstructs the input using the given hidden representation. They are similar to RBM in a way that the main goal is to find hidden representations which can precisely describe the input data. Unlike the RBM,

(21)

the training of the AEs is performed by the maximization of the mutual information which is a measure of the mutual dependence between two random variables

I(x,ˆx) = Z

x

Z

ˆxp(x,ˆx)log

p(x,ˆx) p(x)p(ˆx)

dxdˆx, (17)

where xis the input sample and ˆxis the output of the AE. This criterion is also known as the infomax criterion. Depending on the distribution of the input variables, it yields the minimization of different cost functions [32]. The AEs can be stacked into stacked autoencoders (SAEs) which can be trained in a similar manner as the DBN.

Both RBMs and AEs are techniques for the unsupervised feature learning. Historically, they were the first solutions to the vanishing gradients problem. Both techniques have convolutional versions and they are widely used for the unsupervised pretraining [32].

3.2.4 Batch normalization

The success of the training process significantly depends on the ability of the deep neural network to adapt to the input data, and one of the serious obstacles here is the internal covariate shift [36]. The internal covariate shift is a problem related to the changing of the distribution of activations due to the changes in the distribution of inputs which might slow down the training of the network. In order to reduce the covariate shift, Ioffe and Szegedy [36] proposed to use the batch normalization layers which simply performs the standardization of each input minibatch and collecting the statistics of the input data during the training process. The collected statistics are then used for the processing. As it has been shown in [36], batch normalization accelerates the training and also performs as a regularizer keeping the weights of the network small.

3.2.5 Dropout

Deep learning models have a lot of parameters which implies that there is a substantial chance to encounter the overfitting problem which consists in a good adaptation to the training data while performing bad on the data not included in the training set. Numerous different methods allowing to increase the generalization ability of the deep learning mod-

(22)

els exist, e.g., weight decay is a common norm-based regularization technique which adds a regularization term proportional to the norm of the parameters to the minimized cost function. It is an effective technique, and it has a strong connection with Bayesian methods, i.e., adding the regularization term implies assumptions about prior distributions of the model parameters [32].

Nevertheless, sometimes it is not enough to have only the weight decay in order to prevent overfitting. Another technique called dropout is often used with the weight decay, and it has been shown that it can be more effective than applying just weight decay [37]. The mechanism of dropout is very simple, and the idea is to add the multiplicative noise to inputs of each layer of the model during the training phase only so that the output of the layer becomes

y^(l) =f^(l)(Mx, θθθ^(l)), m ∼p_M(m), (18)

whereMis a noise tensor,m denotes elements of the noise tensor which are distributed according to p_M and denotes Hadamard product. There are two common choices for the noise distributions Bernoulli and Gaussian [37]:

m∼Bernoulli(r), (19)

m∼ N(1,1), (20)

where Bernoulli(r)denotes a Bernoulli distribution with probabilityrwhich in this case is commonly called a dropout rate, andN(1,1)is a Gaussian distribution with meanµ_D = 1 and varianceσ_D = 1. It has been shown that both techniques can be applied equivalently effectively in many cases [37]. It is also common to use the following scaling in dropout [37]:

m ∼ 1

1−rBernoulli(r), (21)

m ∼ N

1, 1

1−r

, (22)

which is referred as weight averaging, and it implies that the expected output of the network remains the same [37].

The reasons of the effectiveness of the dropout regularization technique have been a hot discussion topic [38] in the deep learning community. Recently it has been shown that

(23)

dropout can be treated as a technique of stochastic variational inference in Bayesian deep learning models [39].

3.2.6 Bayesian deep learning

Above the deep learning systems performing maximum likelihood inference and trained with maximum likelihood method are discussed. The described models perform well on big datasets and have simple mathematical explanations. However, this treatment does not allow to estimate uncertainty of the results effectively which may be crucial part of making decisions based on the outputs of the deep learning systems. Moreover, despite the classical deep learning models scale well enough for big data, they often have poor performance on small datasets. It has been shown that the Bayesian treatment of deep learning models may help to overcome these obstacles [38].

A key point of Bayesian methods is defining the parameters of the model as a random variable and considering its posterior distribution [38]

p(θθθ|X,Y) = p(Y|X, θθθ)p(θθθ) R

θθθp(Y|X, θθθ)p(θθθ)dθθθ. (23) Given the described posterior and a new samplex^∗, it is possible to define the posterior predictive distribution over the variable of interesty^∗ [38]

p(y^∗|x^∗,X,Y) = Z

θθθ

p(y^∗|x^∗, θθθ)p(θθθ|X,Y)dθθθ. (24) Thus, the Bayesian inference averages (marginalizes) over the model’s parameters in order to get the posterior distribution of the variable of interest, whereas maximum likelihood treatment provides only point estimate of the parameters and the variable of interest. In order to get the point estimate and estimate the uncertainty, the mean and the variance of the posterior predictive can be used.

Typically the integrals in Eq. 23 and 24 are intractable due to complexity and nonlinearity of the model. The common solution to the inference and the training problems in complex models is to use variational approximations. The main idea here is to replace the complicated posterior p(θθθ|X,Y)(Eq. 23) by a simpler approximation qˆ_φφφ(θθθ)and to minimize

(24)

the Kullback-Leibler divergence (KLD) which can be treated as an asymmetric measure of similarity between two distributions [38]:

KL(q_φ_φ_φ(θθθ)kp(θθθ|X,Y)) = Z

θθ θ

q_φ_φ_φ(θθθ)log q_φ_φ_φ(θθθ)

p(θθθ|X,Y)dθθθ, (25) ˆ

q_φφφ(θθθ) =arg min

q

KL(qφφφ(θθθ)kp(θθθ|X,Y)), (26) whereφφφdenotes parameters of the variational posterior approximation, and its particular form depends on the chosen form of the approximating posterior. For the given variational approximation of the posterior distribution of the parameters, the approximated posterior predictiveqˆ_φ_φ_φ^∗ can be defined as follows [38]:

ˆ

q_φ_φ^∗_φ(y^∗|x^∗,X,Y) = Z

θθθ

p(y^∗|x^∗, θθθ) ˆq_φ_φ_φ(θθθ)dθθθ. (27)

Optimizing Eq. 26 directly is troublesome, and usually the evidence lower bound (ELBO) is maximized. The relation between KLD and ELBO can be expressed as follows

KL(q_φ_φ_φ(θθθ)kp(θθθ|X,Y)) = Z

θθθ

q_φ_φ_φ(θθθ)log q_φ_φ_φ(θθθ) p(θθθ|X,Y)dθθθ

= Z

θθθ

q_φφφ(θθθ)logqφφφ(θθθ)p(Y|X) p(θθθ,X,Y) dθθθ

= Z

θθθ

q_φ_φ_φ(θθθ)log q_φφφ(θθθ)

p(θθθ,X,Y)dθθθ+logp(Y|X)

= Z

θθθ

q_φ_φ_φ(θθθ)logq(θθθ) p(θθθ)dθθθ−

− Z

θ θ θ

qφφφ(θθθ)logp(Y|X, θθθ)dθθθ+logp(Y|X)

=−L_VB(φφφ) +logp(Y|X). (28) Since logp(Y|X)is a constant, and KLD is non-negative, L_VB is a lower bound of the marginal likelihood [38], and maximization of ELBO is equivalent to the minimization of KLD

logp(Y|X)≥KL(qφφφ(θθθ)kp(θθθ|X,Y)) +L_VB(φφφ). (29) Thus, the following optimization problem is solved [38]

φ

φφ=arg max

φ φ φ

L_VB(φφφ) (30)

(25)

by the maximization of the ELBO [38]

L_VB(φφφ) = Z

θ θθ

q_φφφ(θθθ)logp(Y|X, θθθ)dθθθ−KL(qφφφ(θθθ)kp(θθθ))

=X

i

Z

θθθ

q_φ_φ_φ(θθθ)logp(yi |xi, θθθ)dθθθ−KL(q_φ_φ_φ(θθθ)kp(θθθ)) (31) which is similar to the maximum likelihood cost function (Eq. 7) except the marginaliza- tion of the parameters of the model and the KLD term. The maximization of the first term (expected log-likelihood) in Eq. 31 improves the quality of the model fit, and the second term can be treated as the regularization term which keeps the approximating posterior close to the prior distribution.

Similar to the situation with the maximum likelihood function, the maximization of ELBO involves calculations of the intractable expectations. The natural way of solving these kinds of problems in Bayesian methods is utilizing Monte Carlo integration in order to approximate intractable expectations. The resulted algorithm can then be treated as a stochastic variational optimization algorithm. Various sampling techniques exist, and one of the most simplest is using the dropout regularization as a sampling method and applying the classical stochastic optimisation techniques to fit the parameters. In [38] it has been shown that the resulted training algorithm can be treated as a stochastic variational Bayesian method.

The particular form of the KLD term in Eq. 31 depends on the choice of the prior distribution. In [38], it has been shown that the prior over the model parameters can be defined as the uninformative uniform prior or the prior defined by a certain type of the weight decay regularization.

In the classical treatment of dropout, it is assumed that dropout is used only during the training phase, and using it in testing phase only decreases the performance of the model.

From the Bayesian point of view applying dropout during the test time one can sample the outputs of the model directly from the approximate variational posterior predictive Eq.

(27), and get the point estimate of the output as the mean of these sample, and estimate the uncertainty as the variance of the samples.

Using dropout as a sampling method one can also define the parametersφφφof the variational

(26)

posterior distribution using the following parametrization [38]:

f_m^(l) x, θθθ^(l),M

=f^(l) Mx, θθθ^(l)

=f^(l) x, n M, θθθ^(l)

, (32)

where n is a function of the multiplicative noise and the parameters, and its particular form depends on the type of the layer. The variational posterior with its parameters can be defined then as [38]

φφφ=

φφφ^(l) ^L_l , φφφ^(l) =n M, θθθ^(l)

, (33)

q_φφφ(θθθ) = Y

l

q_l φφφ^(l)

. (34)

3.3 Retinal image segmentation

The majority of the neural networks based algorithms for the retinal image segmentation are based on classical schemes with handcrafted features and low level initial segmentation methods. In the works [12], [14], [28], [40], [41], [42], the initial segmentation was performed, and neural networks were used for the classification of the interesting objects such as blood vessels, optic disc and exudates. The feedforward fully connected architectures were used. Another group of algorithms can be distinguished where banks of Gabor filters, and spectral features are used [43], [44], [45]. These methods are similar to the convolutional neural networks except that the feature extractors are not trainable and the feature extractors are linear, whereas the convolutional neural networks provide trainable kernels with the nonlinear activations.

Wang et al. [18] proposed the convolutional neural network architecture for the blood vessels segmentation. The CNN was trained for the hierarchical feature extraction, and random forests were used for the classification purposes. All features from all pooling layers and the last fully connected layer were utilized by random forests. The authors got the accuracy equal to the 0.9733 for the DRIVE database [46] and 0.9813 for the STARE database [47].

Grinsven et al. proposed a 9-layer CNN architecture [19] for the hemorrhage detection on the fundus images from the Kaggle dataset [48]. The standard stochastic gradient with selective sampling training was performed. The mechanism of the selective sampling is

(27)

based in the principle that the representative samples have higher probability to be chosen for the training. Here the classification error was used as a measure for the representativity.

The authors obtained results 0.931 for sensitivity and 0.915 for specificity.

Li et al. [49] described cross-modality learning approach for vessels segmentation. The proposed architecture is a four layer fully connected architecture pretrained with the autoencoders and it is based on a principle that the input image and the segmented image have the same hidden representation. Li et. al achieved sensitivity equal to 0.75, specificity equal to 0.98 and AUC 0.97 for DRIVE and STARE databases.

3.4 Hyperspectral image segmentation

The progress of deep learning applications in the hyperspectral image analysis is more intensive than in the retinal image analysis. Variety of architectures exists. Below some state of the art architectures are discussed.

Liu et al. [20] proposed a DBN architecture with four hidden layers for the remote sensing images classification with an active learning. The feature learning there was performed through unsupervised greedy pretraining and then supervised fine-tuning using SGD with backpropagation was used to train the network for classification. The neural network takes the spectra as the inputs, and the segmented image is composed of the labels predicted by the network. The active learning is similar to the selective sampling, and it is also used to choose the most informative samples from the dataset and to construct a dictionary of the most useful samples. The proposed method effectively utilizes the spectral information, but the spatial information is ignored here.

Zhao et al. [21] proposed a system which utilizes both spectral and spatial information. The spectral features are extracted using the balanced local discriminant embedding (BLDE) which is the dimensionality reduction method that tries to maximize the local between-classes margin through the optimization of the Fisher’s ratio objective function.

The spatial information is captured by the CNN which is trained on the patches of the given images which are obtained by performing the principle component analysis (PCA) on the

(28)

spectra. The CNN was trained to classify patches, but it is used for the feature extraction.

The features from the last layer of the CNN are combined with the features obtained by BLDE, and then the resulting feature vectors are classified using the logistic regression classifier.

Romero et al. [22] used a similar approach based on the PCA, kernel PCA (kPCA) and convolutional neural networks with sparse unsupervised pretraining on the image patches.

The described approach is the classical deep learning scheme for the hyperspectral image analysis. The main disadvantage of the methods proposed in [21] and [22] is that the dimensionality reduction (DR) methods and CNNs are trained separately, and the errors from the CNN are not backpropagated to the DR methods to improve the overall performance.

Ma et al. [23] proposed a contextual deep learning system based on SAEs. The system utilizes both the spectral and spatial information. The spatial information is captured by merging the information from the adjacent pixels and passing it to the SAE which further utulizes the trained spectral feature extractors. The output layer is the softmax regression layer which is used for the classification purposes and trained in a supervised manner with the SGD. The overall network is fine-tuned in the supervised learning step.

(29)

4 CONVOLUTIONAL ENCODER-DECODER ARCHI- TECTURE FOR RETINAL BLOOD VESSELS SEG- MENTATION

4.1 SegNet architecture

SegNet is a state of the art convolutional encoder-decoder architecture for the semantic segmentation proposed by Badrinarayanan et. al [50] which was initially used for scene understanding tasks. The encoder-decoder architecture consists of two parts: the encoder and the decoder (Fig. 5). The main idea is that the input and the segmented image have the same hidden representation. Thus, during the training it is necessary to train the encoder so that it could find the way to encode the input so that the decoder could infer the label map from the code produced by the encoder. Since the encoder and the decoder are parts of the same network, both of them can be trained using a single CE loss function.

Fig. 5. SegNet architecture. The black arrows show the inference directions, and the red arrows represent backpropagation directions.

Both the encoder and decoder are convolutional neural networks each containing a certain number of blocks. Each block contains the convolutional layer, batch normalization layer and ReLU activation. Each block of the encoder contains max-pooling layer after the ReLU activation, whereas the decoding block contain the upsampling layer as the output layer. Usually the encoder sequentially increases the dimensionality of the propagated signal in the filter space and reduces the spatial dimensionality, and after that the goal of the decoder is to reconstruct the probability map so that the spatial dimensionality of the output map is equal to the spatial dimensionality of the input image, and each pixel contains

(30)

the probability of the presence of the certain object in this pixel. Thus, the dimensionalities of the decoding blocks are equal to the dimensionalities of the encoding blocks in the reversed order (Fig. 6). The output layer of the decoder is the convolutional layer with the sigmoid activation function which reduces the dimensionality of the output signal to a single image which is the probability map.

The described architecture can be easily adapted for the retinal blood vessels segmentation for both the colour and hyperspectral images, i.e., the network takes the RGB or hyperspectral image as the input and produces the probability map which indicates with which probability each pixel contains a vessel. Since some areas of the images do not contain information about the fundus, the output probability map is multiplied by the corresponding FOV mask. After that a simple thresholding scheme can be used to obtain segmented image, i.e., all the pixels which have the likelihood greater than a manually set thresholdT_care marked as the vessels, otherwise they are marked as the background. Here the threshold valueT_c= 0.125has been used, since empirically it has been found that this threshold provide satisfactory results in the most cases.

(a)

(b)

Fig. 6. (a) The encoder where the Conv-EC block consists of the convolutional layer with the specified shape, batch normalization layer, ReLU activation and max-pooling (2 ×2); (b) The decoder where the Conv-DC block consists of the convolutional layer with the specified shape, batch normalization layer, ReLU activation and upsampling (2×2).

(31)

4.2 Dimensionality reduction layers

In many standard approaches for the hyperspectral data analysis, the first step is the dimensionality reduction of the spectral information [11]. This approach has been also successfully utilized with deep learning methods [21], [22]. Usually dimensionality reduction methods are applied to reduce the amount of trainable parameters in deep neural networks.

However, the application of the classical dimensionality reduction methods is quite unnat- ural in deep learning, since the whole system then is trained using two cost functions: one for the network and another for the dimensionality reduction method, and the errors from the networks are usually not backpropagated to the dimensionality reduction methods (Fig.

7). Furthermore, it is quite unclear whether the dimensionality reduction method significantly helps to the network to segment images and to minimize the segmentation loss function.

Fig. 7. Schematic representation of systems where classical DR methods are used with deep neural networks.

It is well known that classical DR methods such as PCA or Independent Component Anal- ysis can be formulated as fully connected layers without activations, since both these methods just perform the linear transformation of the input data. The natural incorporation of this methods would be to include this transformation to the architecture with some activations and train the resulted system end-to-end, i.e., backpropagate the errors from the network to the DR methods. Using the formalism of the CNNs this idea can be formulated as adding1×1convolutional layers with some activation. Thus, the DR methods can be naturally included into deep neural networks and the parameters of the network and the DR layers can be trained to minimize the same loss function. This approach was successfully applied in [51] for the hyperspectral image classification.

(32)

Following this idea, the SegNet architecture can be adapted for the hyperspectral images with including DR layers in the network, as it is shown in Fig.8. Here the DR blocks are implemented in a manner similar to the encoder and decoder blocks are built, i.e., each block contains the1×1 convolutional layer, batch normalization and ReLU activation, but without the max-pooling layer. In this work, the resulted architecture is referred as DR-SegNet.

Fig. 8. DR-SegNet architecture.

4.3 Regularizations

In order to avoid overfitting, the l₂ regularization and dropout were used. Applying l₂ regularization to each layer implies the following prior distribution over the parameters of the model:

p θθθ^(l)

=N 0, σ_w²I

, (35)

where σ_w² is a tunable parameter controlling the regularization strength which is set to σ_w² = 10⁻⁴.

Each encoder and decoder block has the dropout layer. Since it has been found that using dropout with dropout layers does not help to improve the generalization, but only decreases the performance of the model, the dropout layers have not been used in the DR layers. Here experiments with the classical dropout with weight averaging (Eq. 22) and Monte Carlo dropout (MC dropout) have been carried out. By default, classical dropout with weight averaging is used. The experiments with MC dropout are described in section 5.4, since

(33)

MC dropout was used only for uncertainty estimation. The dropout rate used in this work isr= 0.5.

4.4 Data preprocessing

As a simple preprocessing step, image normalization was used, i.e., the following formula was applied to all the images

x= 255 x−min(x)

max(x)−min(x). (36)

After the normalization step the contrast enhancement algorithm known as contrast limited adaptive histogram equalization (CLAHE) is applied to each channel of the image.

CLAHE applies histogram equalization to distinct regions of images and limits the contrast in order to prevent significant noise amplification [52]. The described architectures show satisfactory performance without any preprocessing, but applying the described preprocessing steps helps to significantly speed up the convergence of the networks.

(a) (b)

Fig. 9. (a) Orginal image; (b) Result of applying CLAHE algorithm.

Due to memory limitations, during the training and testing processes all images were divided into patches with size256×256×30where the first two numbers define the spatial dimensionality and the last number determines the spectral dimensionality.

(34)

4.5 Data augmentation

The DiaRetDB2 dataset contains 55 images, and due to this fact the training of the deep neural networks can be troublesome, since deep learning systems usually require thousands of samples to achieve desirable performance. In order to overcome this problem and to improve the generalization, the data augmentation approach was utilized, i.e., each image was rotated by90^◦,180^◦,270^◦and added to the training set. The whole dataset was divided into the training set with N_training = 25images and the testing set N_testing = 30 images.

Taking into account the patchwise processing and data augmentation, the total number of the training samples isN_training^(b) = 1600.

4.6 Optimisation settings

The SegNet architecture for retinal blood vessels segmentation is trained using the binary cross-entropy loss (Eq. 8). This loss function does not take into account the imbalance of the classes in the dataset. For the DiaRetDB2 dataset, it has been found that there are approximately 8 times more background pixels than pixels related to the vessels. In order to improve the performance of the blood vessels classification, the loss function has been scaled for each class: for the background the weight is w_b = 1, and for the vessels the weight isw_v = 8.

For training the deep neural network, it is necessary to set the learning rate (Eq. 12) which is a difficult tuning problem. Numerous approaches have been proposed to avoid this problem [53]. In this work, the algorithm known as Adadelta was used [54]. Adadelta accumulates statistics of the gradient estimates and the parameter updates, and adapts the latter, applying the following formulas for each dimension of the stochastic gradient tensor ggg_θ^(t)_θ_θ :

µ g²

t=ρµ g²

t−1+ (1−ρ)g²_t, (37)

∆θ_t=−σ[∆θ]_t−1

σ[g]_t g_t, (38)

(35)

µ

∆θ²

t=ρµ

∆θ²

t−1+ (1−ρ)µ

∆θ²

t, (39)

σ[g]_t= q

µ[g²]_t+, (40)

whereµ[g²]and µ[∆θ²] are the accumulated statistics which initially are equal to zero,

∆θ is a parameter update, ρ is a learning rate decay and is a constant added to avoid division by zero in Eq. 38. The tunable parameters are usually chosen as follows = 10⁻⁸ andρ = 0.95. In this work, all the architectures have been trained using minibatch stochastic gradient descent with Adadelta algorithm and batch sizeNM = 5, and batches are randomly sampled so that every image and its transformations are used only once during each epoch.

(36)

5 EXPERIMENTAL RESULTS

All experimental results have been obtained by a program implemented with the program- ming language Python 3.5 using the deep learning framework Keras 2.0 [55] with the Ten- sorFlow 1.0 backend [56] which utilizes the CUDA deep learning library cuDNN 8.0 [57].

For the basic image processing operations and statistical system evaluation the OpenCV 3.2.0 [58] and scikit-learn 0.18.1 [59] libraries have been used. The results have been vi- sualized using matplotlib 2.0.0 library [60]. The server for computational experiments of Machine Vision and Pattern Recognition Laboratory has been utilized for all the compu- tations. The server is equiped with two NVIDIA GeForce GTX TITAN Black graphical processor units, central processing unit Intel Xeon CPU E5-2680 and 128 gigabytes random access memory.

5.1 System evaluation

In case of blood vessel segmentation the true positive TP refers to the amount of pixels which are classified correctly as vessels, and false negatives FN denotes the amount of pixels which are classified as the background, while they correspond to the vessels. Sim- ilarly, the true negative TN refers to the amount of pixels which are correctly classified as background, and false positive FP denotes the amount of pixels which are classified as vessels, while they correspond to the background.

In order to compare the performance of the architectures, it is common to compare re- ceiving operation characteristic (ROC) curves which illustrate the true positive and false positive rates with the varying threshold. The following metrics and characteristics are used also:

1. Sensitivity measures the ability of the network to correctly detect the pixels which belong to the vessels:

SE= TP

TP+FN. (41)

2. Specificity measures the ability of the network to correctly detect the pixels which

(37)

do not belong to the vessels, i.e., the background pixels:

SP= TN

TN+FP. (42)

3. Accuracy gives the ratio of correctly classified pixels to the total amount of pixels:

ACC= TP+TN

TP+FN+TN+FP. (43)

4. Area under the curve (AUC) is the area under the ROC curve, and it is a metric of a classifier performance which considers all possible thresholds, whereas sensitivity, specificity and accuracy are calculated for some fixed threshold.

5. The number of parameters (N_parameters) is a crucial characteristic which has to be taken into account, since it affects on the required computational resources.

6. The training time (τ_training) is the total amount of time taken by the whole training process.

7. The inference time (τinference) is used to measure the speed performance of the models.

5.2 Comparison of SegNet and DR-SegNet architectures

As it has been stated in section 4.2, dimensionality reduction layers can be utilized to reduce the number of parameters. Here the influence of DR layers on the system’s performance is investigated, and four different architectures have been tested: spectral SegNet, DR-3-SegNet, DR-16-3-SegNet and DR-16-8-3-SegNet. The first architecture is a basic SegNet architecture which takes the hyperspectral image as the input and does not perform any DR on the data. The three latter architectures are DR-SegNet architectures with different dimensionality reduction layers, e.g., DR-16-8-3-SegNet includes three DR layers which sequentially reduce the spectral dimensionality to 16, 8, and then 3.

All the architectures have been trained with 50 epochs. The evolution of the loss function is illustrated in Fig. 10a, and it is clear that all the architectures reach convergence by the 40th epoch. The ROC curves are shown in Fig. 10b, and comparison of the evaluation metrics

(38)

for the architectures is given in Table 1. All the variants of DR-SegNet show almost equal ROC curves and AUC with the typical value AUC ≈ 0.97. From the results one can see that the spectral SegNet has the largest number of parameters and acceptable AUC which is comparable with the DR-SegNet architectures. However, comparing the processing time, training time, accuracy and AUC, it is clear that addition of the DR layers does not lead to significant improvements neither in segmentation results nor in the processing speed, but also might increase the training time.

10 20 30 40 50

Epoch 0.15

0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55

Cross-Entropy loss

Spectral DR-3 DR-16-3 DR-16-8-3

(a)

0.0 0.2 0.4 0.6 0.8 1.0

False Positive Rate 0.0

0.2 0.4 0.6 0.8 1.0

True Positive Rate

DR-16-3 Spectral DR-3 DR-16-8-3

(b)

Fig. 10. (a) Loss function changes during the training; (b) ROC curves of the considered architectures.

Table 1. Comparison of the performance of the four tested architectures

Architectures SE SP ACC AUC N_parameters τ_training, s τ_inference, s

DR-3 0.922 0.922 0.922 0.976 1333994 3736 0.71

DR-16-3 0.931 0.905 0.908 0.974 1334512 4149 0.76 DR-16-8-3 0.923 0.913 0.914 0.973 1334656 4522 0.78 Spectral 0.922 0.922 0.922 0.976 1349441 3618 0.77

Example segmentation result for spectral SegNet is given in Fig. 11. The examples of the segmentation results and examples of the output of the DR layers for the different DR- SegNet architectures are presented in Fig. 12. The resulted label maps are smooth and it is not necessary to perform any postprocessing steps for noise removal. Nevertheless, some vessels on the resulted label maps are disconnected due to the fact that some areas of the vessels are blurred and have low contrast with the background, and no prior information about the connectivity of the vessels was included in the model.

(39)

(a) (b) (c)

(d) (e) (f)

Fig. 11. (a) Original image; (b) Contrast enhanced image; (c) Ground truth label map; (d) Proba- bility map; (e) Segmentation result; (f) Difference of the segmentation result and the ground truth, white pixels show missclassified pixels. The results are obtained using spectral SegNet.

The difference images illustrating the missclassified areas are presented in Fig. 11f and Fig. 13. From the figures one can clearly see that the majority of the missclassified pixels are concentrated at the edges of the vessels, and this fact can be explained by inaccurate pixelwise vessels markings. Indeed, the process of marking the dataset is difficult, and it is hardly possible to choose the right area of marking, since edges are often blurred, and one can easily underestimate or overestimate the width of the vessels. During the investigation of the segmentation results, unmarked vessels have been found which have been detected by all the considered neural networks.

Examples of the images with the reduced spectral dimensionalities are presented in Fig.

12. It is difficult to estimate the quality of the outputs of the DR layers, but it is clear that the images with the reduced spectral dimensionality still preserve acceptable contrast

Deep learning for retinal image segmentation

DEEP LEARNING FOR RETINAL IMAGE SEGMENTATION

ABSTRACT

ACKNOWLEDGEMENTS

TABLE OF CONTENTS

LIST OF SYMBOLS AND ABBREVIATIONS

1 INTRODUCTION

1.1 Background

1.2 Objectives and delimitations

1.3 Structure of the thesis

2 FUNDUS IMAGING

2.1 Eye structure

2.2 Blood vessels characterization

2.3 Spectral fundus imaging

2.4 DiaRetDB2 dataset

3 ARTIFICIAL NEURAL NETWORKS BASED SEMAN- TIC SEGMENTATION ALGORITHMS

3.1 Semantic segmentation

3.2 Deep neural networks

3.3 Retinal image segmentation

3.4 Hyperspectral image segmentation

4 CONVOLUTIONAL ENCODER-DECODER ARCHI- TECTURE FOR RETINAL BLOOD VESSELS SEG- MENTATION

4.1 SegNet architecture

4.2 Dimensionality reduction layers

4.3 Regularizations

4.4 Data preprocessing

4.5 Data augmentation

4.6 Optimisation settings

5 EXPERIMENTAL RESULTS

5.1 System evaluation

5.2 Comparison of SegNet and DR-SegNet architectures