Convolutional neural networks in osteoporotic fracture risk prediction using spine DXA images

(1)

Convolutional neural networks in osteoporotic fracture risk prediction using spine DXA images

Tomi Nissinen

Master's Thesis

School of Computing Computer Science

March 2019

(2)

ITÄ-SUOMEN YLIOPISTO, Luonnontieteiden ja metsätieteiden tiedekunta, Kuopio Tietojenkäsittelytieteen laitos

Tietojenkäsittelytiede

Opiskelija, Tomi Nissinen: Konvoluutioneuroverkot osteoporoottisen murtumariskin ennustamisessa selkärangan DXA-kuvista

Pro gradu –tutkielma, 64 s.

Pro gradu –tutkielman ohjaajat: Prof. Xiao-Zhi Gao ja FT Sami Väänänen Maaliskuu 2019

Tiivistelmä: Osteoporoosi on luustoa haurastuttava sairaus, joka altistaa luunmurtu- mille ja vakaville seurannaisongelmille. Arvioiden mukaan yli 200 miljoonaa ihmistä kärsii osteoporoosista maailmanlaajuisesti. Diagnostiikan kehityksen myötä osteoporoosi on kuitenkin mahdollista havaita jo ennen murtumien syntymistä. Jos murtumariski havaitaan ajoissa, ennaltaehkäisevällä hoidolla ja elämäntapamuutoksilla voidaan parantaa potilaan elämänlaatua ja säästää hoitokustannuksissa. Osteoporoosin diag- noosi perustuu kaksienergisellä röntgenabsorptiometrialla (DXA) tehtävään luunti- heysmittaukseen. Kohdealueen luustosta muodostuu kuva, josta luuntiheysarvo mää- ritetään. Luuntiheysarvo itsessään ennustaa murtumariskiä kuitenkin vain välttävästi, joten vaihtoehtoisille ennustemalleille on tarvetta. Viime vuosina syväoppiminen ja erityisesti konvoluutioneuroverkot ovat tuottaneet vaikuttavia tuloksia kuvantunnis- tuksessa. Ne ovat osoittaneet toimivuutensa myös lääketieteen ja ortopedian kuva-ana- lyysiongelmissa. Tämän tutkimuksen tavoitteena on selvittää, voidaanko konvoluu- tioneuroverkoilla ennustaa osteoporoottista murtumariskiä selkärangan DXA-kuvista.

Lupaavimpia tutkimussuuntia tällaisten ennustemallien kehitykselle tutkitaan kokeel- lisesti erilaisten arkkitehtuurien avulla. Tutkimus tarkastelee myös, miten syväoppi- miseen yleisesti liittyvät oppimisnopeuden ja läpinäkyvyyden ongelmat ilmenevät DXA-kuvien analysoinnissa, ja miten niitä voitaisiin ehkäistä.

Avainsanat: konvoluutioneuroverkko, syväoppiminen, kuvantunnistus, osteoporoosi, kaksienerginen röntgenabsorptiometria

ACM-luokat (ACM Computing Classification System, 2012 version):

• Computing methodologies ~ Supervised learning by classification

• Computing methodologies ~ Neural networks

• Computing methodologies ~ Artificial intelligence

(3)

UNIVERSITY OF EASTERN FINLAND, Faculty of Science and Forestry, Kuopio School of Computing

Computer Science

Student, Tomi Nissinen: Convolutional neural networks in osteoporotic fracture risk prediction using spine DXA images

Master’s Thesis, 64 p.

Supervisors of the Master’s Thesis: Prof. Xiao-Zhi Gao and PhD Sami Väänänen March 2019

Abstract: Osteoporosis is a disease that weakens the bone strength causing fractures and life-threatening complications. It has been estimated to affect more than 200 million people worldwide. Improving diagnostic technology and assessment facilities have made it possible to detect the disease before it causes fractures. Early prediction of fracture risk enables preventive actions and lifestyle changes that can improve the patient’s quality of life and save costs to society. Diagnosis of osteoporosis is based on measuring the bone density using Dual-energy X-ray absorptiometry (DXA). This technique produces an image representing bone density at the scanned site. However, bone density itself is only a moderate predictor of fracture risk, which creates demand for alternative prediction models. Deep learning, and especially convolutional neural networks, has been the leading image analysis approach in recent years. It has pro- duced good results also in medical image analysis, including some orthopaedical problems. This study seeks to discover if convolutional neural networks can predict osteoporotic fractures from spine DXA images. By experimenting with different network architectures, the study aims to gain an understanding of the most promising design directions of such prediction models. In this context, also some practical deep learning challenges such as low training speed and lack of transparency are addressed.

Keywords: convolutional neural network, deep learning, image recognition, osteoporosis, dual-energy x-ray absorptiometry

CR Categories (ACM Computing Classification System, 2012 version):

• Computing methodologies ~ Supervised learning by classification

• Computing methodologies ~ Neural networks

• Computing methodologies ~ Artificial intelligence

(4)

List of Abbreviations

AUC Area Under the Curve BMC Bone Mineral Content BMD Bone Mineral Density

CNN Convolutional Neural Network CPU Central Processing Unit

CT Computed Tomography

DXA Dual-energy X-ray absorptiometry GPU Graphical Processing Unit

ILSVRC The ImageNet Large Scale Visual Recognition Challenge OSTPRE Kuopio Osteoporosis Risk Factor and Prevention Study QCT Quantitative Computed Tomography

QUS Quantitative Ultrasound RAM Random Access Memory ReLU Rectified Linear Unit

ROC Receiver Operating Characteristic TBS Trabecular Bone Score

WHO World Health Organization

(5)

1 Introduction

Osteoporosis is a skeletal disorder characterized by compromised bone strength and consequent increase in fracture risk (WHO, 1994). According to WHO report (2003) osteoporosis affects more than 75 million people in Europe, Japan and the USA.

Worldwide estimates exceed 200 million people affected (Cooper & al., 1992). Oste- oporosis does not only cause fractures but also causes people to become bedridden with secondary complications that can be life threatening for the elderly.

Machine learning techniques have been applied to orthopaedical problems in several studies. Conventional machine learning techniques such as decision trees, linear re- gression and support vector machines among others have been utilized for models based on predictor variables (Cabitza & al., 2018). Instead of using manually extracted variables, an alternative approach is to use raw images as input data. In recent years, deep learning, and especially convolutional neural networks (CNN), has shown prom- ising results in a wide variety of image analysis tasks. CNN can automatically learn relevant image representations to produce a specific output (Bengio & al., 2013; Si- monyan & Zisserman, 2014).

CNN has been used in some orthopaedical studies, producing good results for example in segmentation of bone structure (Cernazanu-Glavan & Holban, 2013) and diagnosis of osteoarthritis (Tiulpin & al., 2018). This study seeks to discover if and how convolutional neural networks can predict osteoporotic fractures from spine Dual-energy X- ray absorptiometry (DXA) images. It can bring new knowledge about the feasibility of deep learning techniques for DXA image analysis. The study also attempts to find solutions to practical deep learning challenges such as low training speed and lack of transparency.

The problem context, including relevant aspects of bone physiology, osteoporosis and imaging technology, is introduced in chapter 2. In chapter 3, the fundamental theory behind artificial neural networks is explained. The principle of CNN, as well as its architectural scope and training parameters, is presented in chapter 4. The setup, pro-

(8)

2 Osteoporosis and diagnostics

This chapter introduces the medical aspects of osteoporosis and the diagnostic methodology used for its detection. Section 2.1 describes the basic physiology of bone and the biological processes related to the disease. The risk factors shown to predict osteoporotic fractures are also outlined. In section 2.2 the current methodology for diagnosis of osteoporosis is presented. The available imaging technology is introduced, focusing on the one behind the dataset of this study. Analysis methods, and ways to evaluate their predictive ability, are also discussed.

2.1 Osteoporosis

Human skeleton is continuously regenerated in a process called remodeling. On adults, approximately 10% of the bone mass is renewed each year (Manolagas, 2000). Re- modeling happens through specialized cells called osteoclasts and osteoblasts. Old bone tissue is removed by osteoclasts, while new bone is formed by osteoblasts. In osteoporosis, this process is out of balance, causing more bone tissue to be removed than created (Buckwalter, 1995; Manolagas, 2000).

The World Health Organization (WHO) defines osteoporosis as systemic skeletal disease characterized by low bone density and deterioration of bone microarchitecture.

These changes decrease bone strength making the bone more susceptible to fracture.

In osteoporosis, the morbidity of the disease arises specifically from the associated fractures (WHO, 2003).

There are two forms of bone tissue. The cortical bone that makes the hard surface of the bone and the more lightweight interior tissue called trabecular bone (also known as cancellous bone). Trabecular bone is less dense, having approximately twenty times more surface area per unit of volume than cortical bone. Since bone remodeling happens starting from the bone surface, loss of bone mineral occurring in osteoporosis has more imminent effect on the architecture of trabecular bone (Buckwalter, 1995). Tra-

(9)

these reasons, fractures caused by osteoporosis are commonly located in the bones with high amount of trabecular bone, such as vertebrae in the spine, bones of the forearm, and the hip.

Figure 2.1. Cortical and trabecular bone in vertebra (Malowney, 2013).

Osteoporosis occurs most commonly in aging women. According to WHO (2003), about 15% of white women in their 50s and 70% of those over 80 are affected. Osteo- porosis is three times more common in women than in men. Hormones have an important function in preserving bone mass, and the decline in estrogen levels during menopause leads to accelerated bone loss. Women also live longer than men. Increas- ing life expectancy in developed countries means that women now live more than one- third of their lives after the menopause, and thus the prevalence of osteoporosis has increased.

2.1.1 Diagnosis and treatment

Until recently, osteoporosis was an under-recognized disease and considered an inev- itable consequence of ageing. But now, improving diagnostic technology and better assessment facilities have made it possible to detect the disease before it causes fractures (WHO, 2003). Early prediction of fracture risk enables preventive actions and lifestyle changes. Better management of the disease can both improve the patient’s quality of life and save costs to society (Cosman & al., 2014).

Measuring bone mineral density (BMD) by DXA is considered the current gold stand-

(10)

below the young adult reference mean meet the definition of osteoporosis. Persons with bone density below this threshold who also sustain a fracture are considered to have established or severe osteoporosis.

There is no single treatment to completely reverse osteoporosis. Yet, there are different ways to manage it and decrease the risk of fractures. The main aim of intervention is to prevent bone loss in individuals who already manifest the condition or who are at risk (WHO, 2003). The management strategy is determined partly by the fracture risk.

Younger people with lower risk are given lifestyle recommendations such as regular exercise, calcium intake and smoking cessation (Cosman & al., 2014). Pharmacologi- cal interventions are involved with those at higher risk, usually post-menopausal women. Available therapies include estrogens, estrogen derivatives and selective estrogen receptor modulators, bisphosphonates, vitamin D, calcitonin and denosumab (WHO, 2003; Josse & al., 2013). They act mainly by reducing bone resorption and turnover. The medicine prescribing practices vary heavily between different countries.

Also, the available comparative studies are not comprehensive enough to state the most effective treatment (WHO, 2003).

2.1.2 Bone strength

Skeleton’s main purpose is to bear loads. After exposing to small or moderate stress, the elasticity of bones enables them to return to their original shape. Bone strength refers to the stress where the bone ultimately fails and suffers permanent deformations (Väänänen, 2014). Therefore, bone strength can also be considered as the bone’s ability to resist fractures.

Bone mineral density is not the sole predictor of bone strength. Less than half of the variation in whole-bone strength is attributable to variations in bone mineral density.

In fact, most of the patients who sustain fragility fractures do not fall under the WHO osteoporosis threshold of BMD below -2.5 standard deviations (Unnanuntana, 2010).

Regarding osteoporosis, the strength of trabecular bone is at interest. Trabecular bone is a highly porous, multi-scale structure. The mechanical properties of trabecular bone

(11)

the trabecular bone is formed from groupings of trabeculated bone tissue. Its complex structure is connected by thin rods and plates, contributing to maximal strength with minimum mass (Buckwalter, 1995). The way in which the bone loss affects this architecture has significant effect on bone strength. Trabecular thinning predominates in men, while loss of connectivity dominates in women. Studies have shown that loss of connectivity has more dramatic effect on vertebra strength than thinning (Seeman, 2008).

Figure 2.2. Micrographs of normal and osteoporotic bone (Cosman & al., 2008).

Many have studied the relationship between architectural factors and the macro scale mechanical properties. However, modeling the bone structure has proven to be difficult. A literature review by Wu & all (2018) shows that there is a large variation in the results reported by different studies. Bone strength is a complex property to measure without actually exposing the bone to mechanical stress tests.

Because of the architectural complexity of bone, the possible visual signs of bone strength available in X-ray images are not well known. This can be seen as motivation for utilizing deep learning techniques that do not rely on manual feature extraction.

Instead, by random initialization of weights, they could find textural patterns or shapes that correlate with occurrence of fractures. In theory, this could lead to discovery of novel diagnostic evidence of osteoporotic changes in the bone.

2.1.3 Fracture risk

Osteoporotic fractures are low-trauma or fragility fractures attributable to osteoporo-

(12)

forearm and humerus. Other fractures can also occur as a result of osteoporosis but these sites are more clearly associated with osteoporotic changes in the trabecular bone (Johnell & Kanis, 2006).

Although BMD is the standard measure of osteoporosis, its fracture prediction ability is only moderate. BMD considers only the density of the bone and not the quality.

Several clinical risk factors, independent of BMD, have been identified for fragility fractures. These include age, history of fracture, demographic and physical character- istics, use of other medication, family history of fracture, body weight and many lifestyle factors (Unnanuntana, 2010).

Because of the limited prediction ability of bone density, efforts have been made to formulate a system to better predict fracture risk. In 2008, WHO introduced the Frac- ture Risk Assessment Tool (FRAX) developed by Kanis & al. (2008). It is used for estimating the probability of osteoporotic fractures during 10-year time period. FRAX calculates the risk value based on clinical risk factors, such as age, previous fractures, medication history, body mass index, family history and lifestyle. It can be used with or without the support of BMD. However, A study by Ensrud & al. (2009) found that the FRAX model didn’t produce any better prediction results than simpler models, such as BMD combined with age. Similar results were reported in a study by Gonzá- lez-Macías & al. (2012).

2.2 Diagnostics

The treatment of osteoporosis is mostly preventive. It aims to slow down further loss of bone or increase bone density. Therefore, early diagnosis or identification of high risk is crucial for preventing complications. Precise imaging methods provide the basis for different diagnosis and prediction models. The methods are mainly used for two purposes: identifying the presence of osteoporosis, and calculating the bone density (Guglielmi & Scalzo, 2010).

Screening for osteoporosis to identify people at high risk has been under debate. In

(13)

new prediction models and automated diagnostics could contribute to general screening programs in the future.

2.2.1 Imaging methods

Imaging methods used for diagnosing osteoporosis include conventional radiography, Dual-energy X-ray absorptiometry, quantitative computed tomography (QCT) and ultrasound methods. Radiography requires about 30% bone loss to be apparent on X-ray images and the analysis is done manually by medical expert. It also exposes the patient to a radiation dose, and is rarely conducted without a specific reason such as suspected fracture (Guglielmi & Scalzo, 2010).

DXA can measure the bone mineral density without a visual analysis by expert. It uses low energy X-ray beams reducing the radiation amount to about one twelfth compared to conventional radiography. This results in lower resolution, but enough to calculate BMD values for different joints (Messina & al., 2016).

QCT is an extension of ordinary computed tomography (CT) to not only produce an image but also to measure tissue density. It provides separate estimates of the bone mineral density for trabecular and cortical bone (NOS, 2018). It has also shown good results in predicting vertebral fractures. However, QCT has its disadvantages such as high radiation dose, poor precision and high imaging costs (Guglielmi & Scalzo, 2010).

Quantitative ultrasound (QUS) is a generic term for ultrasound measurements of bone.

In these techniques, a sound pulse is transmitted through the bone site, usually the heel, and received in the other end (NOS, 2018). The method requires no radiation and has advantages like low cost of device and quick measurements. It is ability in fracture risk prediction has been found to be close to BMD measured by DXA (Hans & al., 1996).

Through development of new technology, the efficacy of quantitative ultrasound methods in managing osteoporosis is expected to improve in the future (Guglielmi &

Scalzo, 2010; Karjalainen & al., 2015).

(14)

2.2.2 Dual-energy X-ray absorptiometry

Dual-energy X-ray absorptiometry was introduced in 1987 and is the most widely used method of diagnosing osteoporosis in clinical practice (Guglielmi & Scalzo, 2010).

DXA can be applied to all skeletal sites where osteoporotic fractures occur. These include the lumbar spine, proximal femur, forearm and calcaneus. In DXA system, a radiation source is aimed at a radiation detector placed directly opposite the site to be measured. The patient is placed on a table and the source/detector assembly is then scanned across the measurement region. (El Maghraoui, 2012).

DXA uses two different energy X-ray beams to form two images. From these images, the contribution of bone and soft tissue can be solved mathematically to calculate bone density. The attenuation coefficients of two soft tissue components, lean and fat tissue and their ratio in particular patient is first calculated by measuring at a location adja- cent to bone (NOS, 2018). DXA provides the mass of bone mineral content (BMC) in grams and the projected area of the measured site in square centimeters. BMD is taken as the BMC divided by the area and is given in units of g/cm2 (Guglielmi & Scalzo, 2010). Individual BMD values form a digital image where each pixel represents a measurement point through the patient.

Figure 2.3. BMD analysis of lumbar spine DXA image (Wanderman & al., 2017).

(15)

2.2.3 Trabecular Bone Score

BMD considers only the density of the bone and not the bone quality. As its ability to predict fractures is only moderate, alternative methods for bone quality assessment have been searched. A method called trabecular bone score (TBS) was developed by Pothuaud & al. (2008) to reflect bone microarchitecture. It is calculated by using textural analysis from spine DXA images to reflect the decay in trabecular bone structure characterizing osteoporosis.

The way by which TBS reflects microarchitecture is still questionable. According to review by Bousson & al. (2015), the correlations between TBS and bone microarchitectural parameters vary widely across bone sites, microarchitectural parameters, and study designs. Interestingly, Maquer & al. (2016) have shown that TBS improves fracture prediction, but does not correlate directly with vertebra strength. Therefore, it is not clear what predictive information the method extracts from the images. Analyzing results and visualizations of new prediction models in comparison with TBS values could shed some light on the underlying factors.

2.2.4 Evaluation of diagnostic ability

To compare the diagnostic performance of different predictors, it is important to have a common measurement of discriminative ability. Occurrence of fractures can be considered as binary classification output, but the positive and negative samples usually have very imbalanced prevalence in the dataset. Therefore, classification accuracy is heavily dependent on the discrimination threshold, which defines the probability bor- der when to classify sample as positive rather than negative. To avoid this dependency, receiver operating characteristic (ROC) curve is often used to indicate the signifi- cance of a predictive factor in medical diagnosis (Narkhede, 2018).

ROC curve is a graph that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied. It is created by plotting the true positive rate against the false positive rate at all possible threshold settings (Fawcett, 2005). To get a single numerical value representing the diagnostic performance illustrated by an ROC curve,

(16)

of 1.0 represents a perfect predictor, while an area of 0.5 represents a completely random classifier. In medical diagnosis, higher the area under ROC curve, better the model is at distinguishing between patients with disease and no disease (Narkhede, 2018). As an example, AUC for BMD to predict fragility fractures is typically between 0.6-0.75 (Leslie & Lix, 2013). An example of an ROC curve is shown in Figure 2.4.

Figure 2.4. ROC curve plots the true positive rate (TPR) against the false positive rate (FPR).

AUC measures the area underneath the ROC curve. Adapted from Narkhede (2018).

(17)

3 ARTIFICIAL NEURAL NETWORKS

Artificial neural network is a software system loosely inspired by the human brain. It consists of an interconnected network of artificial neurons that learns from experience by modifying its connective weights. The behavior of an artificial neuron depends on the weight values of its input connections and the activation function that transforms the input values to a single output value (van Gerven & Bohte, 2018). After training a neural network, representations become encoded in a distributed manner across all its neurons (Hinton & al., 1986). It has been proven that neural networks are universal function approximators (Cybenko, 1989).

This chapter provides the theoretical background for artificial neural networks. Sec- tions 3.1 and 3.2 explain the principle of feedforward neural network (FNN) and the role of activation functions. The basics of training and optimization of a neural network is discussed in sections 3.3 and 3.4. In section 3.5 deep learning is introduced and the consequences of making the networks deeper are considered.

3.1 Feedforward neural networks

In a feedforward neural network, the information moves in only forward direction from input nodes to output nodes. The connections between the nodes do not form cycles.

The simplest kind of feedforward neural network is a single-layer perceptron (SLP) originally introduced by Frank Rosenblatt (1958). It consists of a single layer of input nodes and a single layer of output nodes.

In the most basic case, the inner product of input vector x and weight vector w is transformed into a response y by activation function f.

y = f (w

^T

x)

The schematic of single-layer perceptron is illustrated in Figure 3.1.

(18)

Figure 3.1. Single-layer perceptron (Raschka, 2015).

By connecting together multiple neurons we can build a feedforward neural network (Figure 3.2) representing a non-linear function y consisting of nonlinear transfor- mations fi and network parameters Ɵ (van Gerven & Bohte, 2018).

y = f (x;Ɵ)

Figure 3.2. Feedforward neural network.

Each layer of the network is typically vector-valued consisting of many parallel neurons. The dimensionality of these layers determines the width of the model. For more complex function approximations, also the depth of the model can be increased by adding one or more hidden layers between the input layer and output layer (Figure 3.3). This kind of architecture is called a multi-layer perceptron (MLP). The input data flows through a chain of functions before producing an output value (Goodfellow &

al., 2015).

(19)

Figure 3.3. Multi-layer perceptron with one hidden layer.

3.2 Activation functions

The activation function of a neuron defines its output given a certain input. It maps the resulting output values into the desired range, such as 0 to 1 or -1 to 1. The simplest activation function is a step function that beyond certain threshold outputs 1 and oth- erwise 0 as shown in Figure 3.4a (Avinash, 2017).

Another choice is a linear activation function, a straight-line function where activation is proportional to input (Figure 3.4b). This approach makes the overall model linear in nature regardless of the number of layers, meaning that the final output will be nothing but a linear function of the input layer (Avinash, 2017).

The need to approximate non-linear functions require us to use non-linear activation functions. One of the most widely used activation functions is the sigmoid function. It has a differentiable S-shaped curve that outputs a real value between 0 and 1 as shown in Figure 3.4c. It is commonly used for models predicting probability values as an output (Avinash, 2017).

In deep neural networks, the most recommended activation function is the rectified linear unit (ReLU) (Nair & Hinton, 2010; Glorot & al., 2011). It performs a nonlinear transformation through piecewise linearity as shown in Figure 3.4d. By remaining very close to linearity, ReLU is easy to optimize with gradient-based methods and preserves good generalizability (Goodfellow & al., 2015).

(20)

Figure 3.4. Activation functions.

3.3 Gradient-based optimization

Optimization refers to the task of either minimizing or maximizing some objective function by altering its input value. When optimizing a neural network, we are trying to minimize the prediction error compared to the desired output value. The function that produces this error is called the cost function (also known as loss function, or error function) (Goodfellow & al., 2015).

The derivative f’(x) depicts the slope of function f(x) at the point x. It is useful for minimizing a function because it tells how to change x in order to improve the output y. We can thus reduce f(x) by moving x in small steps with opposite sign of the deriv- ative as illustrated in Figure 3.5. This technique is called gradient descent (Cauchy, 1847).

(21)

Figure 3.5. Gradient descent (Raschka, 2015).

To deal with neurons having multiple inputs, we must make use of the concept of partial derivatives. The partial derivative measures how the objective function f(x) changes with respect to one specific variable at point x. The gradient of function f is the vector containing the partial derivatives of all the variables (Goodfellow & al., 2015).

3.4 Backpropagation

Backpropagation is the most widely used learning algorithm for artificial neural net- works (Nielsen, 2015). Its main purpose is to associate each neuron with a cost value that reflects its contribution to the overall cost value of network output. Backpropaga- tion can be performed after an input vector has been fed to the network and it has propagated through all the layers to the output layer. The output layer error given by the cost function is then propagated back through the network. The cost values are calculated by using the partial derivative of the error with respect to the different weights of a neuron. Then the gradients for each layer are iteratively computed by using the chain rule (Rumelhart & al., 1986; Le Cun, 1988). The chain rule is a formula for computing the derivative of the composition of multiple functions. The information passing through multiple layers of nodes can be thought of as such composition.

(22)

Figure 3.6. Backpropagation in feedforward neural network.

When each neuron has been assigned with a cost value, its connective weights can be updated using the gradient descent (Rumelhart & al., 1986). For all the connection weights in the network, a certain percentage of the weight's gradient is subtracted from the weight. The used percentage is called the learning rate, and it affects the learning speed and stability of the optimization process (Goodfellow & al., 2015).

Using backpropagation algorithm and updating the network weights after each training sample is computationally expensive. To improve efficiency, the training dataset can be split into small batches that are used to calculate model error and update the weights (Bengio, 2012). This optimization method called mini-batch gradient descent is commonly used in deep learning.

3.5 Deep neural networks

Conventional machine-learning techniques usually have limited ability to process raw data without manual feature extraction. Therefore, creating a powerful pattern-recognition system has traditionally required domain expertise and careful engineering (LeCun & al., 2015). For example, in image analysis tasks this means transforming the image data into feature vectors which can then be used to train the system.

Representation learning is a set of techniques that allows a system to automatically

(23)

information processing, corresponding to different levels of abstraction. The bottom layers near the input learn low level patterns whereas top layers near the output learn features of higher abstraction level (van Gerven & Bohte, 2018).

Deep neural networks can model complex non-linear relationships, required for example in machine vision, natural language processing and other difficult AI tasks. The use of multiple hidden layers enables hierarchical composition of features from lower layers. Therefore, it potentially requires fewer nodes compared to similarly performing shallow network (Bengio, 2009).

Increasing the depth of the network also introduces new challenges for the training algorithm. Firstly, the training of deeper networks requires more computational power.

Luckily, machine learning algorithms involve a lot of matrix and vector calculation, which can be easily parallelized. By using graphics processing unit (GPU) computation and cloud computing, the training speed of deep networks can be significantly improved (Deng & Yu, 2014).

Secondly, the training of deep neural networks is usually based on backpropagation, where local gradients for the prediction error are calculated in each layer and node of the network. The relative contribution to the error tends to get smaller as the computation advances back in the layers towards the input. This problem is called vanishing gradient. Its severity increases significantly as the depth of the networks increases, often causing the optimization to slow down or even stop (Nielsen, 2015; Goodfellow

& al., 2015).

Especially highly non-linear activation functions such as the sigmoid function suffer from vanishing gradient. Rectifiers such as ReLU can help avoid the problem, since they only saturate in one direction (Glorot & al., 2011). Because of the step-wise linearity of ReLU activation function, taking the derivative of its output always produces gradient of either zero or one. Therefore, continuous multiplication does not decrease the gradient the same way it does with highly-nonlinear activations.

Third, and the most commonly occurring problem is overfitting. During the training

(24)

previously unseen input data. Increasing the depth of the network enables it to model more complex non-linear relationships in the data, but it also makes it more prone to overfitting (Eldan & Shamir, 2016). Different regularization techniques can be used to reduce generalization error. Although, their effectiveness highly depends on the learning task at hand (Goodfellow & al., 2015).

(25)

4 Convolutional neural networks for image analysis

As neural networks take numerical data as input, image analysis might seem like a simple case of feeding the network with raw image data. After all, images are nothing but grids of numbers representing pixel intensities. Actually, this approach can work well for simple recognition tasks with small, normalized images (Nielsen, 2015). The problem is poor scalability. When the input size and number of layers increase, the amount of required computation explodes. Also, the full connectivity of input pixels and hidden layers quickly leads to overfitting. Another deficiency of fully connected architecture is that it ignores the topology of the input data. This doesn’t serve the purpose of analyzing 2D data such as images (LeCun & Bengio, 1995). Convolutional neural networks are a special class of deep neural networks aiming to solve the afore- mentioned problems.

This chapter presents the idea of using convolutional neural networks for image analysis. Section 4.1 explains the principle of convolutional neural networks. In section 4.2 the major building blocks of CNN architecture design are explained. In section 4.3 the training procedure and related hyperparameters are discussed. Transfer learning and use of pretrained models are introduced in chapter 4.4. Chapter 4.5 ponders the challenge of transparency in prediction systems and introduces a visualization technique for tackling that challenge.

4.1 Convolutional neural network

Convolutional neural network (CNN) is a special kind of neural network for processing data that has a grid-like topology (LeCun, 1989). It is based on deep feedforward neural network, consisting of multiple layers of neurons that have learnable weights and biases. In addition, CNN introduces three architectural ideas to ensure shift and distor- tion invariance: local receptive fields, shared weights and spatial down sampling (LeCun & Bengio, 1995).

(26)

feature detectors are useful across the entire image. So, these units can share weights to simplify the model, thus reducing the number of free parameters. This limits the capacity of the network, but more importantly, improves its generalization ability (LeCun, 1989). The set of weight sharing units is called convolutional filter or kernel.

The output of a filter constitutes a feature map. A convolutional layer is usually com- posed of several feature maps with different filters.

The convolutional filters are small matrices, usually only a few pixels of width and height. The filter slides through the input and multiplies the pixel values with the filter's weights producing the feature map as an output. In neural networks this process is described as a convolution by convention. Mathematically it is actually a cross-cor- relation rather than a convolution (Goodfellow, 2015). However, the principles of these two operations are very close to each other.

Once a feature has been detected, it is sufficient to preserve its approximate position relative to other features. Therefore, each convolutional layer is usually followed by an additional down sampling layer that reduces the resolution of the feature map. Dis- carding the exact location of a feature reduces the sensitivity of the output to shifts and distortions (LeCun & Bengio, 1995). The combination of convolution and down sampling was inspired by the organization of cells in the animal visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field (Hubel

& Wiesel, 1959).

4.2 Network architecture

The architecture of a CNN is a sequence of layers that transform the input data into an output value. Commonly the overall architecture can be divided into feature extraction part and classification part. The feature extraction is performed by convolutional layers, down sampling layers and rectifier layers, while the classification uses fully con- nected layers. The basic design work involves choosing the combination of different layer types and adjusting their parameters. An example of a typical convolutional neural network architecture is illustrated in Figure 4.1.

(27)

Figure 4.1. Architecture of a CNN. Adapted from Patel & Pingel (2017).

The best performing architecture for a pattern recognition task is always case dependent. However, the lack of clear design guidelines may prompt practitioners to rely on existing standard architectures, regardless of their suitability to the application at hand.

These architectures are publicly available and proven in visual recognition benchmark challenges such as The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (Russakovsky & al., 2015). Research into CNN architectures proceeds so fast that a new best architecture for a given benchmark is announced every few months (Goodfellow & al., 2015).

4.2.1 Layers and nodes

It is notable that the winners of the ILSVRC have successively used deeper networks.

Having more hidden layers possess greater representational power making it possible to approximate more complex functions. Also, fast development in the field continuously introduces new techniques and architectures expanding the toolbox for design.

However, science has always embraced simplicity, and it is reasonable to argue that the network should be kept as simple as possible (Smith & Topin, 2017). A study by Springenberg & al. (2014) has even questioned the common practice of using pooling and sophisticated activation functions in CNNs. Instead, they suggest using just convolutional layers with properly selected stride and filter size parameters. With these kinds of simple architectures, they have shown competitive results on the ImageNet challenge.

Another design choice to make is the number of nodes in each layer. For input and

(28)

of the data and the complexity of the function the network is supposed to approximate.

Similar to depth of the network, also the width can cause overfitting. The complexity of a neural network is determined by the number of free parameters, which in turn is determined by the number of neurons (Hagan & al., 2014). So again, to make a design that generalizes well, we need to find the simplest network that fits the data.

4.2.2 Down sampling

An important element in architectural design is the understanding of trade-offs. One fundamental trade-off is the maximization of representational power versus elimina- tion of redundant and non-discriminating information. This is often achieved by down sampling activations and increasing the number of channels from the input to the final layer (Smith & Topin, 2017).

The most common way of down sampling in CNN is by using pooling layers. In pooling operation, the outputs of several nearby feature detectors are combined into bags of features, so that irrelevant details are removed but task-related information is pre- served. Pooling is used to achieve transformation invariance, more compact representations and better robustness to noise (Boureau, 2010). The pooling operation is typically a sum, an average or a maximum.

The process of calculating the feature map in a convolutional layer can be thought of as a filter moving through a grid of data. The step of the filter’s movement is defined by a parameter called stride. A stride of one means that the filter moves by one pixel after each operation until the whole image is processed. Increasing the stride reduces the overlap of receptive fields and reduces spatial resolution of the layer’s output. So, the spatial dimensionality reduction performed by pooling can be alternatively achieved using a proper stride value (Springenberg & al., 2014).

4.2.3 Regularization

Another design trade-off is training accuracy versus generalization to unseen cases.

The problem of poor generalization is also known as overfitting. When designing the

(29)

(Smith & Topin, 2017). The key idea is to randomly drop units from the network during training. This prevents the units from co-adapting too much (Srivastava, 2014).

Another commonly used regularization technique is L2 regularization, also known as weight decay. The idea of L2 regularization is to add an extra regularization term to the cost function. This term is calculated as the sum of the squares of all the weights in the network scaled by the size of the training set. The intuition is to make the network prefer to learn small weights. regularization can be viewed as a way of compro- mising between finding small weights and minimizing the original cost function (Niel- sen, 2015).

When training a model that overfits the task, it is typical that training error decreases steadily over time, but validation error begins to rise again at certain point (Figure 4.2). This is the point where the model has reached its best generalization ability before it starts to overfit. By saving the model weights each time the validation error decreases, it is possible to restore the best parameters at the end of training. This simple and effective regularization strategy is known as early stopping (Goodfellow & al., 2015).

Figure 4.2. Typical relationship between training and validation error.

(30)

4.3 Training and hyperparameters

After designing the basic architecture of the machine learning model, there are still many decisions to be made that affect its performance. The behavior of the learning algorithm can be adjusted through several settings called hyperparameters. Hyperpa- rameters control various aspects of the learning algorithm and can have big effects on the model’s performance (Claesen & Moor, 2015). They are the kind of settings that are not appropriate to learn by training. For example, if the settings controlling model capacity were learned on the training set, they would always adjust to maximum, resulting in overfitting (Goodfellow & al., 2015). Instead, the hyperparameters are set before the training process.

Typical hyperparameters include training-parameters such as learning rate, batch size and number of epochs. The architecture related properties such as number of layers and layer size are also considered hyperparameters. Convolutional neural networks introduce additional parameters like number of filters, filter size, padding and stride.

Different hyperparameter combinations should be tested in order to maximize the use- fulness of the learning approach (Goodfellow & al., 2015). Hyperparameter search is commonly done manually by trial and error, following rules-of-thumb or by systematic grid search. In addition, the idea of automating hyperparameter search is receiving increasing research attention. Automated approaches have already been shown to out- perform manual search by experts on several problems (Claesen & Moor, 2015).

4.3.1 Optimization algorithms

The backpropagation learning method is an iterative process of two repeating phases:

propagation and weight update. In the propagation phase the cost values for each node are calculated. These costs are then used to calculate the gradient with respect to the weights. The gradient is fed to the optimization method, which uses it to update the weights (Nielsen, 2015). The simplest optimization method uses the gradient with a fixed learning rate coefficient. But it is difficult to select the learning rate precisely in a task with large dataset and long training period. Several variations of typical gradient

(31)

Momentum

Momentum aims to improve convergence speed and reduce oscillation by accelerating gradient vectors in the right directions. The momentum term increases for dimensions whose gradients point in the same directions and decreases for dimensions whose gradients change directions (Ruder, 2016). It is very popular algorithm used not only by many state-of-the-art models but also by other optimization algorithms.

Adagrad

Adagrad algorithm adapts the learning rate to the parameters performing larger or smaller updates depending on their importance. It modifies the general learning rate at each step for every parameter based on the past gradients of that parameter (Duchi, 2011; Smoliakov, 2010). One of Adagrad’s benefits is that it eliminates the need to manually tune the initial learning rate (Ruder, 2016).

Adadelta & RMSProp

Adagrad implies a decreasing learning rate even with constant gradients due to accumulation of gradients from the beginning of training. Adadelta and RMSProp, two Adagrad extensions very similar to each other, seek to reduce this effect. They change the gradient accumulation into an exponentially weighted moving average, discarding the history from the distant past (Smoliakov, 2010).

Adam

Adam combines RMSProp and momentum. It adds a bias correction mechanism, and instead of raw stochastic gradients, uses smooth version of the gradient (Smoliakov, 2010). The authors of Adam have shown empirically that it works well in practice and compares favorably to other adaptive learning methods (Kigma & Ba, 2017). In his review of modern optimization algorithms, Sebastian Ruder (2016) recommends using Adam as the default choice.

4.3.2 Epochs and batches

(32)

batches). The mean error of the samples in a single batch is used to perform the backpropagation in one training step (Bottou, 2018). When all the batches that constitute the training set have passed through the network, one epoch is complete. The training of a neural network is an iterative process, and in order to converge, multiple epochs are usually required (Goodfellow & al., 2015).

Larger batches provide a more accurate estimate of the gradient, but in parallel processing the batch size is often limited by the amount of memory. Small batch size can also add regularization, possibly due to the added noise in the training process (Wilson

& Martinez, 2003). It is also important that the mini-batches are selected randomly.

Computing an unbiased estimate of the expected gradient from a set of samples requires the samples to be independent and varying (Goodfellow & al., 2015).

4.4 Transfer learning

In order to maximize performance for a specific visual recognition task, the traditional approach is to design a highly specialized deep network and optimize its hyperparameters. In many cases with proper training data and expertise we can get impressive results this way. However, training a deep CNN from scratch requires a large amount of labeled training data as well as extensive computational and memory resources (Tajbakhsh & al., 2016). Also, it may prove difficult to find the proper architecture and learning parameters to tackle overfitting and convergence issues that come with deep architecture (Erhan & al., 2009).

Convolutional neural networks can be used to learn generic image representations that can serve as off-the-shelf features. These features are generic in the sense that they can be used to solve tasks far from those they were originally trained for (Azizpour, 2014).

The process of applying previously learned knowledge to different but related problems is referred as transfer learning (Bengio & al., 2013).

In transfer learning, the pretrained model can be either used as a feature extractor or as a basis for fine tuning. In feature extraction we replace the pretrained model’s output

(33)

In fine tuning we train also the feature extraction layers. Training all the layers of a very deep model can be time consuming, so often its useful to train only some of the top layers. This way the low-level feature extraction is kept intact, but the higher abstraction layers adapt to our specific task (Tajbakhsh & al., 2016).

Pretrained models have been applied successfully to various image analysis tasks.

Transfer learning implementations often utilize models trained with the Imagenet dataset (Russakovsky & al., 2015) such as VGG (Simonyan & Zisserman, 2014), ResNet (He & al., 2015) or Inception (Szegedy & al., 2015). The Imagenet dataset contains over 14 million images hand-annotated to indicate what objects are pictured. It contains over 20 thousand categories varying from animals and plants to cars and sports.

The substantial differences between these pretraining images and medical images may raise skepticism, but a study by Tajbakhsh & al. (2016) has shown that even in medical problems, transfer learning approach can compete with CNNs trained from scratch.

Using pretrained models also seems to be more robust to the size of the training set.

4.5 Visualization

Commonly recognized challenge in neural networks has been the lack of transparency.

This is especially relevant in the context of medical diagnostics (Tiulpin & al., 2018).

In order to gain trust, the prediction system has to be able to explain why it predicts what it predicts. In the development phase of such systems, transparency also helps the researchers to focus on correct research directions. In the case of well performing artificial intelligence system, interpretable machine may even teach humans about how to make better decisions (Selvaraju & al., 2017).

To address the transparency problem, visualization techniques can be used to identify the image regions most important for the model’s prediction. It is also possible to high- light the shapes and textures the network sees by visualizing pixel-wise impact on prediction score. Sevalraju & al. (2017) have proposed a technique called Gradient- weighted Class Activation Mapping (Grad-CAM), which uses the gradients of the tar- get class flowing into the final convolutional layer of the network to produce a coarse

(34)

Another visualization technique is called guided backpropagation. It produces an il- lustration of the detected features that contributed to the class prediction. This is done by taking account only the neurons that had a positive effect on the output, while others are set to zero. Combining different visualization techniques can produce effective visual explanations for the model’s predictions. We can get an idea on the shape and size of the most important extracted features, as well as their relative importance.

Figure 4.3. Example visualization of handwritten digit classification. The original image on the left, localization heatmap in the middle and guided backpropagation map on the right.

(35)

5 Experimentation

This chapter describes the experiments conducted in the search of answers to the research questions. In section 5.1, selection, preprocessing and use of the dataset is explained. The hardware and software setup for the experiments is described in section 5.2 and the used neural network architectures in section 5.3. The results are presented and analyzed in section 5.4 and further discussed in section 5.5.

5.1 Data

The primary data source for the study was a database collected by Kuopio Osteoporo- sis Risk Factor and Prevention Study (OSTPRE). OSTPRE is a population-based pro- spective cohort study originally aimed to investigate factors associated with bone mineral density, bone loss, falls and fractures in peri- and postmenopausal women. It in- cludes enquiry data of 14220 elderly female from Kuopio region, collected since 1989 (OSTPRE, 2018).

OSTPRE patients have been measured in 5-year intervals. In the 15-year follow-up conducted between 2003-2006, DXA images were taken from a subset of around 3000 patients. At the time of imaging, the age of the patients varied from 61 to 73 years.

Information about the fractures was collected in the 20-year follow-up questionnaire in 2009. The hip fractures have been verified from national registers, but other fracture types are based on patient’s own reports.

5.1.1 Output variables

Focus of the experiments was in the prediction of osteoporotic fractures. By definition, these fractures are low-trauma fractures attributable to osteoporosis, commonly occurring in hip, spine, forearm and humerus (Unnanuntana, 2010). The dataset did not include information on the source event of the fractures. Because of this, the osteoporotic fractures were identified solely based on the site of the fracture. To make the classification task binary, the count of suffered fractures was discarded, and the output varia-

(36)

By this definition the dataset had 15,5% of positive samples and 84,5% of negative samples.

Additional experiments were conducted to predict BMD and TBS values in order to gain better understanding of the CNN’s ability to learn different output variables from the same input data. This was particularly interesting from the visualization perspective. TBS prediction could show how well textural analysis functions can be approxi- mated by CNN. BMD prediction on the other hand should be almost trivial since the value is directly calculated from pixel intensities of certain interest area. Therefore, it could serve as verification for the prediction functionality and the visualization techniques.

5.1.2 Datasets

In order to evaluate the model’s prediction performance on unseen data, the dataset has to be divided into training and test set. Test set contains samples that are separate from the data used for training the system. They should not be used in any way to make choices about the model, including its hyperparameters. To optimize the hyperparameters, the performance of the model needs to be validated. For this purpose, a subset of the training set is selected to serve as a validation set. It is used to estimate the generalization error during and after training, allowing for the hyperparameters to be updated accordingly (Goodfellow & al., 2015).

In our experiments 80% of the dataset was used for training, 10% for validation and the remaining 10% for testing the model’s overall performance. Samples for each set were selected randomly. The dataset was heavily imbalanced, having much fewer positive samples than negative samples. To get reliable results in such a case, approximately the same distribution needs to be retained across training, validation and test sets. The random selection was performed until all the sets had positive sample rate within one percentage point of the overall ratio. Summary of the datasets is shown in Table 5.1.

(37)

Table 5.1. Datasets used in the experiments. Positive samples are the cases where osteoporotic fractures occurred within 5 years from taking the DXA image.

Dataset Total

samples

Positive samples

Positive sample rate

Training set 2385 372 15,60%

Validation set 300 45 15,00%

Test set 300 46 15,33%

Total 2985 463 15,51%

5.1.3 Augmentation

To gain better coverage of the problem space, the number of training samples can be artificially expanded. This method called data augmentation means modifying samples with random transformations such as rotation and flipping, varying properties like brightness and contrast or by adding random noise (Simard & al., 2003). However, when considering data augmentation, the real variance of the problem space has to be taken into account. Many data augmentation techniques have been designed for natural images where the variance is practically unlimited. Medical images usually have more standardized format. It is not useful to augment the data with such variance that cannot exist in real samples.

Another reason why some augmentations will not work is that pixel intensities are more important in many medical imaging modalities as compared to natural images.

In the DXA images for example, the pixel intensity is a measure of bone density in that particular area of body. Some augmentation techniques can distort this information and complicate the learning process.

(38)

when replacing devices, so in theory, the images should be comparable to ones taken with other devices. Still, there is some variance in the images regarding size and positioning. Some images have blurriness, e.g. due to high amount of fat in the patient’s body. Also, the patient must hold very still while the image is taken or movement artefacts may occur.

The selected methods for augmentation were flipping and rotation. Flipping is very simple technique to process. The corresponding pixel intensities from left and right of the image are switched. Since it does not require any interpolation, the pixel intensities remain intact. The variance it simulates is also very natural, as scoliosis and other asymmetries in spine can occur in either direction.

Some rotation exists in the images, because the orientation of the patient can vary a little. Adding a few degrees of rotation by random can produce new samples with the same properties but slightly different positioning. However, rotation is not quite as trivial to calculate as flipping. The rotation algorithm can produce pixel coordinates which are not integers. In order to generate the intensity of the pixels at each integer position, different heuristics are employed. It means that some data is always lost in the transformation. Furthermore, the rotation operation forces the output image to be cropped to avoid empty pixels. The cost of the operation is slightly reduced image size.

5.1.4 Preprocessing

Applying convolutional networks to image recognition removes the need for a separate hand-crafted feature extractor, but some preprocessing is often useful. Normalizing the images for size and cropping them to contain only the region of interest can improve training speed. Further normalization of orientation should not be required in CNN models. The shared weights and down sampling bring invariance with respect to small geometric transformations or distortions (LeCun & Bengio, 1995).

The width of a DXA image is standard 300 pixels, but the height depends on the length of the scanning process, varying from 134 to 186 pixels in the dataset. The lumbar spine joints are near the vertical center of the image, so some cropping can be safely

(39)

images were dropped as anomalies and the rest were cropped to that height from the top of the image.

Spine fits easily to half the horizontal space of the image, so 150 pixels was also natural choice for target width. In many of the images the spine was not horizontally centered, so a simple centering algorithm was required to avoid cropping regions of interest.

Intensity-weighted horizontal center of the spine was calculated from the vertical center of the image. Cumulative sum of the intensities in that pixel row was calculated, and the half-point was determined as spine center. The outcome of the algorithm was visually verified as sufficient for approximate centering.

The pixels that DXA scanner produces are not square but rectangular, 0.6mm of width and 1.05mm of height. The selected image size of 150x150 pixels corresponds to real area of 90mm x 157,5mm. Therefore, rendering the images with square pixels produces an anatomically incorrect result. This can be fixed by scaling up the height according to the original pixel shape. Figure 5.1 illustrates the preprocessing and augmentation performed for the dataset.

Figure 5.1. Data preprocessing and augmentation. Original image is first cropped. Then aug-

(40)

5.2 Experiment setup

The main purpose of the experiments was to find out if convolutional neural network can predict osteoporotic fractures from spine DXA images. To answer this question, five different CNN models were deployed for the experiments. The effect of different hyperparameters was tested by starting from common or default values and running experiments with adjustments to a single parameter at a time. The optimization of the hyperparameters was performed manually by trial and error.

To get a baseline for the experiment results, the average BMD from joints L2-L4 was used to predict osteoporotic fractures on the test set. This represents the current standard method of diagnosis. Similar prediction test was performed with TBS to assess its fracture prediction ability. To compare CNN performance to a simple neural network model, a single-layer perceptron model was also created.

The area under ROC curve was used as the measure of performance. Additionally, the learning graphs and prediction visualizations were analyzed to evaluate the model’s training behavior. Measuring the exact performance value of each model was not the purpose of this study. Instead, the general level of predictive ability throughout the CNN models was at interest. Also, the differences between the models and their per- formances was viewed from the perspective of learning which type of CNN architecture is suitable for analyzing DXA images. This was also the objective of testing the effects of different hyperparameters.

5.2.1 Hardware

Deep learning requires substantial amount of computing power. The heavy calculation of routines such as convolution and backpropagation make the training of deep convolutional neural networks practically impossible without graphical processing unit (GPU) parallelization. Machine learning involves a lot of simple matrix math calculation that can be efficiently performed in parallel. The amount of display memory in GPU device is crucial. Depending on the complexity of the model, millions of bytes of data are often required to compute one pass of data through the network. The opti-

(41)

The experiments were executed by using a desktop computer with Intel Core i5-6600 3.3GHz central processing unit (CPU), 16GB random access memory (RAM) and Nvidia GeForce GTX 970 GPU with 4GB of display memory. The display adapter supports Nvidia CUDA parallel computing platform and enables GPU-accelerated computation through Nvidia CUDA Deep Neural Network library (cuDNN). The library is integrated to widely used deep learning frameworks such as Tensorflow, Keras, Caffe and PyTorch.

5.2.2 Software

The application for running the experiments was implemented using Python 3.6. The GPU supported version of TensorFlow was selected as the machine learning frame- work. The choice was based on popularity and wide support for different learning models, architectural features and hyperparameters. The custom models were build using low-level TensorFlow libraries to gain more flexibility and control over the model structure. The pretrained models were operated through Keras neural network library running on top of TensorFlow backend. Many deep learning models with pretrained weights are available as Keras Applications. These models can be used for prediction, feature extraction and further fine-tuning.

The created experimentation application implemented the loading of the input images and related data, preprocessing and augmentation of the images, training of the model, prediction from the test data, prediction visualization and reporting of the prediction results. The implemented experimentation process is illustrated in Figure 5.2.

Figure 5.2. The experimentation process implemented in the application.

5.2.3 Reporting and visualizations

Training and validation variables were logged during the whole training process in

(42)

were plotted to same graph using TensorBoard. These graphs are important for understanding how the model’s prediction and generalization abilities develop during the training period.

The model’s weights were serialized and written to disk every time the best AUC of the validation set improved. This way the optimal state of the model was possible to restore at the end of the training process. Then the restored model was used to predict on the test set, and the overall ROC curve and AUC value were calculated from the prediction output.

The prediction visualizations were also generated each time the best validation AUC improved. Both heatmap and guided gradient was generated for selected validation samples. This resulted in a series of images visualizing the model’s adaptation process during the training. Since the DXA image pixels are not square in reality, all the illustrated DXA images and related visualizations are rescaled to appear anatomically correct.

5.3 Model architectures

The experiments were conducted with five different model architectures. Three custom models were created for the task, and two pretrained models were used with different levels of fine tuning. The custom model architectures were designed to represent different depths. The idea behind this is to analyze the effect of increasing representational capacity to the results and visualizations. Additionally, a single-layer perceptron model was created to represent traditional machine learning approach.

5.3.1 Custom models

The custom models are based on repeating sequences of convolutional, pooling and rectifier layers. This kind of convolutional blocks provide the basic CNN process of convolution, down sampling and ReLU activation. The output from those layers represents the high-level features in the data. Adding a fully connected layer before the output layer enables the learning of non-linear combinations of those features. Finally,

(43)

The shallow custom model was designed to act as a default example of CNN for simple pattern recognition tasks. Its architecture was inspired by many CNN architecture ex- amples in documentation and training material available online. Similar architectures have been used in problems such as handwritten digit classification. It has three convolutional layers, two pooling layers, three rectifier layers and a fully connected layer.

The intuition was that the model would extract simple features and make rough ap- proximation.

Since our input images are relatively simple and small, a filter size of 5x5 pixels was used with stride of 1. This is very common size used in many CNNs for small images.

The number of filters was set to 8 in the first layer, and then doubled in each following layer. This is following the recommendation by Smith & Topin (2017). The pooling layers use max pooling with 2x2 filter size and stride of 2. This is a classic setup for pooling with no overlap. ‘Same’ padding scheme was used for both convolutional and pooling layers. It adjusts the amount of padding so that the output size of a layer matches the input size. This way the image size does not get scaled down between the layers. The architecture of the shallow custom model is shown in Table 5.2.

Table 5.2. The architecture of the shallow custom model.

Layer Conv.

layers

Filter size

Pooling Activation Filters/

nodes

Input layer 150x150

Convolutional block 1 1 5x5 Max ReLU 8

Convolutional block 2 1 5x5 Max ReLU 16

Convolutional block 3 1 5x5 ReLU 32

Fully connected layer 128

Convolutional neural networks in osteoporotic fracture risk prediction using spine DXA images