• Ei tuloksia

Gene expression prediction with machine learning; Predicting cancer types from ATAC-seq samples

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Gene expression prediction with machine learning; Predicting cancer types from ATAC-seq samples"

Copied!
25
0
0

Kokoteksti

(1)

MACHINE LEARNING

Predicting cancer types from ATAC-seq samples

Bachelor’s thesis Faculty of Information Technology and Communication Sciences Examiner: Prof. Matti Nykter May 2020

(2)

ABSTRACT

Tuomas Mäenpää: Gene expression prediction with machine learning Bachelor’s thesis

Tampere University Information technology May 2020

Gene expression is the process of building new proteins based on the information stored in DNA. Despite containing the same DNA, cells in our bodies vary greatly in features and in functionality. The selection of which genes are turned on in a cell is called gene regulation.

Regulation happens through the production process of a cell. One of the regulating factors is the structure of the chromatin. Chromatin is formed by DNA and the proteins it’s wrapped around.

The tightness of the structure defines how readable the genes in sections of DNA are. This study examines the prediction of gene expressions from chromatin accessibility with machine learning.

In addition, the effect of limiting the accessibility data to only certain sections of DNA is considered.

The use of machine learning in predicting gene expressions is assessed with three machine learning methods: logistic regression, multilayer perceptron and random forest. The implementa- tions of the selected methods are from the Scikit-learn Python library.

The data used in the study consisted of 410 tumor samples across the 23 most common human cancers. Each sample had its chromatin accessibility measured from 562 709 locations.

The data was split into training and testing data sets. 80 % of each cancer type’s samples were used in the training and the remaining 20 % were used in the testing of the models. The same split was used for each classifier both in classification with all features and in classification with limited features. The goal of the classification was to predict the cancer type of the sample based on the chromatin structure.

The classification was first conducted with all features available in the data, followed by limiting the data to areas of intergenic DNA and genes’ promoters. These areas are known to regulate the gene expression.

The results of this study show that machine learning methods are able to classify tumor sam- ples from chromatin structure with high accuracy. Each classifier reached the classification ac- curacy of 90 %. Limiting the data to intergenic DNA and promoters didn’t have a notable effect on the performance of the classifiers. 5 out of 6 classification score averages reached the 90 % threshold with random forest classification on promoter data averaging at 88.9 %.

Keywords: machine learning, ATAC-seq, multilayer perceptron, logistic regression, random forest, gene expression

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

(3)

Tuomas Mäenpää: Geenien ilmentymisen ennustaminen koneoppimismenetelmien avulla Kandidaatin tutkielma

Tampereen yliopisto Tietotekniikka Toukokuu 2020

Geenin ilmentymällä tarkoitetaan uuden solun tuottamista DNA:sta luettavan perimätiedon pe- rusteella. Vaikka kaikissa soluissa esiintyvä DNA on sama, solut vaihtelevat ominaisuuksiltaan suuresti. Ilmentyvien geenien valikoituminen tunnetaan geenien säätelynä. Geenien säätelyä ta- pahtuu läpi solun luomisprosessin. Yksi säätelevistä tekijöistä on kromatiinin avoimuus, mikä vai- kuttaa proteiinien ympärille kietoutuneen DNA:n luettavuuteen. Tässä työssä tutkitaan geenien ilmentymisen ennustamista koneoppimismenetelmien avulla kromatiinin avoimuutta mittaavasta datasta. Lisäksi työssä tutkitaan datan rajoittamisen tiettyihin DNA:n alueisiin vaikutusta luokitte- lutarkkuuteen.

Koneoppimismenetelmien soveltuvuutta geenien ilmentymien ennustukseen arvioitiin kolmen koneoppimismallin avulla. Tutkimuksessa käytetyt mallit ovat logistinen regressio, monikerroksi- nen perseptroniverkko ja satunnaismetsä. Menetelmien toteutukset valittiin Scikit-learn -python- kirjastosta.

Tutkimuksessa käytetty data sisältää 410 näytettä 23 eri syöpätyypistä. Kustakin näytteestä oli mitattu kromatiinin avoimuustietoa 562 709 kohdasta. Data jaettiin opetus- ja testauspaketteihin siten, että 80 % kunkin luokan näytteistä käytettiin mallien opettamiseen ja jäljelle jäävät 20 % käytettiin mallien testaukseen. Samaa jakoa käytettiin kaikille luokittelijoille sekä koko DNA:n laa- juisella datalla, että tiettyihin DNA:n alueisiin rajoitetulla datalla. Luokitteluongelmassa tavoitteena oli luokitella syöpätyyppejä kromatiinin avoimuuusdatan perusteella.

Tutkimuksessa syöpätyyppejä luokiteltiin koko DNA:n laajuisella datalla, sekä intergeenisen DNA:n ja promoottorien alueille rajatulla datalla. Näiden alueiden tunnetaan yleisesti säätelevän geenien ilmentymiä.

Työn tulokset osoittavat, että koneoppimismenetelmät kykenevät luokittelemaan syöpäsolu- näytteet syöpätyypeittäin kromatiinin avoimuusdatan perusteella. Kullakin menetelmällä saavu- tettiin 90 % luokittelutarkkuus. Intergeenisen DNA:n ja promoottorien alueille rajoitettu data toimi luokittelussa yhtä hyvin kuin koko DNA:n laajuinen data. Jokaisella käytetyllä luokittelumenetel- mällä saavutettiin lähes 90 % luokittelutarkkuus myös rajoitetulla datalla.

Avainsanat: koneoppiminen, ATAC-seq, monikerroksinen perseptroniverkko, logistinen regressio, satunnaismetsä, geenin ilmentymä

Tämän julkaisun alkuperäisyys on tarkastettu Turnitin OriginalityCheck -ohjelmalla.

(4)

CONTENTS

1 Introduction . . . 1

2 Background . . . 2

2.1 Gene regulation . . . 2

2.1.1 Chromatin accessibility and transcription . . . 3

2.2 Related work . . . 4

3 Methodology . . . 5

3.1 Logistic regression . . . 5

3.2 Multilayer perceptron . . . 6

3.3 Random forests . . . 8

4 Experiments . . . 9

4.1 Data . . . 9

4.2 Results . . . 11

5 Conclusions . . . 16

References . . . 17

(5)

2.1 Chromatin structure . . . 3

3.1 Logistic regression . . . 6

3.2 Artificial neuron structure . . . 7

3.3 Multilayer perceptron . . . 7

3.4 Decision tree . . . 8

4.1 Class division in the data set. . . 11

4.2 Confusion matrix from logistic regression classification with all features. . . 13

4.3 Confusion matrix from multilayer perceptron classification with all features. . 13

4.4 Confusion matrix from random forest classification with all features. . . 14

4.5 Confusion matrices from logistic regression classification with limited features 15 4.6 Confusion matrices from MLP classification with limited features . . . 15 4.7 Confusion matrices from random forest classification with limited features . 15

(6)

LIST OF TABLES

4.1 Metrics of the data used . . . 10 4.2 The mean scores of the classifiers. . . 12

(7)

1 INTRODUCTION

Cancer is one of the leading causes of human deaths worldwide causing an estimated 9.6 million deaths every year making cancer the second most common cause of death overall [1]. One of the key factors in successfully treating cancer is detecting the disease early [2]. The quick development of machine learning and the growing amount of biological data may enable the prediction and the detection of deadly diseases such as cancer earlier resulting in better chances of survival for the patients.

This study examines the possibility of predicting gene expressions based on chromatin structure with machine learning. The data set used in the study comes from Corses et.

al. study "The chromatin accessibility landscape of primary human cancers" where 410 tumour samples had the chromatin accessibilities measured with Assay for Transposase- Accessible Chromatin using sequencing (ATAC-seq) [3]. The goal is to classify which cancer type the sample is taken from based on the chromatin accessibility. The task is then developed further by limiting the chromatin data to only certain biological areas to determine how will that affect the classification accuracy. The machine learning tools used in the study are from the Scikit-learn library. The methods used in classification are logistic regression, random forest classifier and a multilayer perceptron.

The thesis is structured followingly. Chapter 2 presents the biological background be- hind the study and reviews other publications related to gene expression prediction. The machine learning methods used in this study are covered in the third chapter. The fourth chapter reviews the data set used in the study, the experiments conducted and the results from the experiments. The final chapter summarizes the results and their meanings for the study.

(8)

2 BACKGROUND

Machine learning is one of the biggest trends in information technology today. Machine learning is a field of artificial intelligence where algorithms learn from data and improve their performance in the tasks they are designed for. The growing amount of data has ac- celerated the development of machine learning as well as the increase of computational power available. [4] Applications for machine learning are constantly being developed across fields from entertainment to medicine. In medicine, applications can be used, for example, in risk assessment and finding common features between illnesses that human physicians might not be able to detect [5].

Gene expression is the process where information stored in genes found in DNA is used to create new functional products, mostly proteins. The process is found in all known forms of life from eukaryotes like humans to prokaryotes such as bacteria. The way genes are expressed determine the shape and features of multi-molecular forms of life.

[6] All cells in our bodies share the same DNA but can perform widely different tasks. For example, neurons and cells in the liver work entirely different jobs but are still products of the same DNA. The difference comes from the genes expressed in these cells. Humans, nor other eukaryotes, don’t express all of their genes at once. The process of selecting which genes to express from DNA is called gene regulation. [7]

2.1 Gene regulation

Gene regulation is the process that determines which genes from DNA are expressed in the functional products produced. The regulation can happen in varying stages of the gene expression and often happens at multiple different stages. The earliest stage where genes are regulated is the structure of the chromatin. The chromatin is formed by DNA and the proteins it’s wrapped around [8]. Regulation also happens at transcription, the process where information from DNA is copied to an RNA molecule. After transcription genes are regulated in later stages of processing RNA, the translation of RNA into the nulceoacids that form proteins and in post-translational control. [7] This study focuses on the regulation that happens at the beginning of the expression, the chromatin struc- ture and the transcription. Given the complexity of the regulation process only the two mentioned stages are going to be explained more in-depth.

(9)

Figure 2.1.How chromatin accessibility regulates expressionable genes [9].

2.1.1 Chromatin accessibility and transcription

The chromatin is made up of DNA wrapped around histone proteins [8]. The structure of the chromatin regulates how readable genes are. The more open the chromatin structure is, the more available the genes in that area are for transcription. [3] This is pictured in figure 2.1. The upper image visualizes a section of DNA wrapping tightly around histone proteins making the genes of said sequence inaccessible for transcription. The lower image depicts a section of DNA wrapping around histone proteins more loosely leaving genes in said sequence available for transcription.

Transcription is the part of the expression process where sections of DNA are copied in to an RNA molecule. In transcription, sets of proteins called transcription factors attach to the DNA at the sections that are copied to the RNA molecule being produced. [6] A gene being transcribed doesn’t guarantee the gene being expressed but genes not transcribed cannot be expressed. [7]

In order to create a new RNA molecule an enzyme named RNA polymerase needs to attach to the DNA of the gene. The sections where RNA polymerases attach to are called promoters. In eukaryotic gene expression the RNA polymerase can’t attach to the gene by itself, it needs the help of general transcription factors to accomplish this.

Transcription factors make it either easier or harder for the RNA polymerase to bind to the promoter of the gene, thus regulating the genes expressed [6].

(10)

2.2 Related work

Machine learning has been used in other studies regarding gene expression and regu- lation. Thibodeau et. al. studied the prediction of enhancers from ATAC-seq samples in their article "A neural network based model effectively predicts enhancers from clinical ATAC-seq samples". Enhancers regulate genes in the transcription process by making their target genes more likely to participate in the expression. In the study it was discov- ered that the neural network model developed was well suited for enhancer prediction with ATAC-seq data. The study also came to the conclusion that many of the primary hu- man cells’ chromatin accessibilities are yet to be thoroughly mapped. [10] More thorough mapping of the human genome would enable more studies like this to possibly reveal new details about the gene regulation of the chromatin structure.

In the article "Integrating regulatory DNA sequence and gene expression to predict genome- wide chromatin accessibility across cellular contexts", Nair et. al experimented with pre- dicting chromatin accessibility with machine learning. Multi-modal, residual neural net- works were used with gene regulatory data to predict chromatin-wide accessibility pro- files. Using residual convolutional neural networks in chromatin accessibility prediction was proven to result in better results than previous structures. The researchers were able to bring new insight to the regulatory chromatin dynamics with the model. Studying the model revealed that it was possible to learn motifs of general and lineage specific transcription factors. [11]

(11)

3 METHODOLOGY

Three different machine learning methods were used in attempt to determine whether predicting gene expressions is possible with machine learning. The same methods are also used later on in examining how much does limiting the data to only selected areas of the gene affect the classification accuracy. The methods selected are logistic regression, multilayer perceptron (MLP) and random forest. The implementations of the methods used are from the scikit-learn python library [12]. The methods are used as "off-the- shelf" -methods since the goal of the study focuses more on the application, not on the development of the models itself.

3.1 Logistic regression

Logistic regression is a linear classification model. It can be used in classifying both binary data and multiclass problems. Logistic regression fits a logistic function to the data so that the data points’ projection on the curve shows the probability of the data point belonging to the class in question [13]. Figure 3.1 depicts the logistic function fitted to some scatter data points.

The optimal parameters for the logistic regression model are estimated using maximum likelihood [13]. Maximum likelihood means finding the parameters for the logistic function that maximizes the likelihood of the training data points belonging to the correct class.

Finding the maximum likelihood is an iterative process where the program starts with some default coefficients, calculates the solution and compares it to set criterion. If the solution doesn’t satisfy the requirements the calculation is carried out again with different coefficients. This procedure is repeated until requirements are satisfied or the maximum amount of iterations is reached. [14]

In multiclass problems, logistic regression has a number of different approaches. The one used in this study is commonly known as "one-vs-rest" where binary problems are fit for all labels [15]. The binary problem is formed by marking all but one of the classes as zeroes. The logistic function is then fit to that data as described above. The process is repeated for each class. In classifying new data the class of which the classification probability is the highest is selected.

(12)

Figure 3.1. An example of the logistic function fitted to data points. Figure modified from Scikit-learn documentation. [16]

3.2 Multilayer perceptron

Multilayer perceptron (MLP) is the basic structure of an artificial neural network. Neu- ral networks imitate the way brains work by making decisions in numerous single units, neurons. Neurons consist of three basic components:

1. Input synapses 2. Summation function 3. Activation function [17].

Input synapses receive the input signal. Each synapse multiplies the input by a weight before the inputs are added together in the summation function. Neurons often have an individual bias that’s added to the summation as well. The result of the summation function is then given to the nonlinear activation function. The activation function is the decision making block in the neuron. It sets the neuron’s output to either 1 or 0 . The structure of a single neuron is presented in figure 3.2.

MLPs are typically structured followingly:

1. Input layer 2. Hidden layers 3. Output layer.

The structure of an example MLP with two hidden layers is presented in figure 3.3. By definition, MLPs have at least one hidden layer in addition to the input and output lay-

(13)

Figure 3.2.The structure of an example neuron with sigmoid activation function. Adapted from source [18].

Figure 3.3. The structure of an example multilayer perceptron. Adapted from source [20].

ers [19]. Generally, the input layer’s nodes don’t have an activation function since their purpose is to feed the input to the network.

The learning process of neural networks is based on tuning the weights and biases of the neurons. In supervised learning the network is given the correct answers to each input signal. The weights and biases of neurons are iteratively tuned to minimize the errors of the model.

The output layer consists of one or more neurons and output signals. The output of the network depends on the purpose of the model. [17] In a binary classification problem the output can be just one signal set to 1 or 0. In multiclass cases the output layer can consist

(14)

Figure 3.4. An example of the structure of decision trees.

of the same number of neurons as there are classes in the case. If so, the output neurons represent the classes and form the output by activating the neuron that represents the class predicted.

3.3 Random forests

Random forests belong to the category of ensemble methods. The goal of ensemble methods is to build multiple small decision units that give a prediction and then merge the results to form the final prediction. [21] The decision units in random forests aredecision trees.

Decision trees are a decision mapping tool that navigate to a decision by meeting a series of conditions and following the answers to the next conditions until the end of the branch is found. The ends of the branches are the tree’sleaves and they contain the decision the tree makes. Figure 3.4 depicts the structure of decision trees.

Decision trees are prone to overfitting which means they easily learn the training data perfectly but don’t generalize well and as a result don’t perform well on data they haven’t seen before. They are also fairly inflexible and can’t handle new labeled data. [4] In ran- dom forests, these issues are minimized by averaging the outputs of multiple estimators.

Averaging the outputs reduces the effects of individual estimators variances greatly. [13]

The learning process of random forests is based onbootstrap averaging, also known as bagging. In bagging, individual trees are given a subset of data with replacement. Trees are built by choosing a random set of variables from available variables. From selected variables the best split-point is selected and the node is split to two child-nodes. The best split is the one which most accurately divides the data. This process is repeated until the set minimum amount of nodes is reached for the tree. [4, 13] Once all the trees are built, the random forest can be used for classification.

(15)

4 EXPERIMENTS

The goal of this study was to examine how well machine learning methods are suited for predicting gene expressions from ATAC-seq samples. In order to evaluate this, three popular machine learning methods were used: logistic regression, multilayer perceptron and random forest classifier.

First, the ability to predict the cancer type of the cell from the entire genome-wide chro- matin accessibility was studied. Then, the data was limited to only given distances form transcription start sites (TSS) and the experiments were repeated.

4.1 Data

The Cancer Genome Atlas (TCGA) is a cancer genomics program founded in 2006 by the National Cancer Institute and the National Human Genome Research Institute. Since then the project has produced multiple petabytes worth of biological data for the public to study and use. [22] The data used in this study is from Corces et. at. article "The chromatin accessibility landscape of primary human cancers". In the article, ATAC-seq data was generated from 410 cancer samples from TCGA [3].

ATAC-seq is a method for mapping genome-wide chromatin accessibility. ATAC-seq mea- sures DNA accessibility by inserting sequencing adapters into accessible parts of the chromatin. Sequencing reads can then be used to locate areas of increased gene avail- ability, transcription-factor binding sites and to map nucleosome positions. [23]

In Corces et. al. study, 562 709 transpose-accessible sites were identified for each sam- ple [3]. Most of the samples in the data set had a technical replicate meaning the tumor sample had the measurements taken twice. For the data set to not contain essentially the same samples twice the technical replicates had to be merged with one another. The merge operation was done by calculating the averages for each measurement.

In the data set the cancer types are labeled with a four letter abbreviation to have a uniform naming scheme. The abbreviations and their corresponding cancer types are presented in table 4.1. Table 4.1 also shows the amount of samples from each class that are available in the data set.

Since different types of cancer have varying rates of incidence the amount of samples also varies in the data set. The most common types of cancer, like breast cancer and colon adenocarcinoma, have dozens of samples while the rarer types, like different brain

(16)

Abbreviation Amount of samples Full name

ACCx 9 Adenoid Cystic Carcinoma

BLCA 10 Urotheial Bladder Carcinoma

BRCA 75 Breast Cancer

CESC 4 Cervical Squamous Cell Carcinoma

CHOL 5 Cholangiocarcinoma

COAD 41 Colon Adenocarcinoma

ESCA 18 Esophageal Carcinoma

GBMx 9 Glioblastoma

HNSC 9 Head-Neck Squamous Cell Carcinoma

KIRC 16 Kidney Renal Clear Cell Carcinoma

KIRP 34 Cervical Kidney Renal Papillary Cell Carcinoma

LGGx 13 Low Grade Glioma

LIHC 17 Liver Hepatocellular Carcinoma

LUAD 22 Lung Adenocarcinoma

LUSC 16 Lung Squamous Cell Carcinoma

MESO 7 Mesothelioma

PCPG 9 Pheochromocytoma/paraganglioma

PRAD 26 Prostate Adenocarcinoma

SKCM 13 Skin Cutaneous Melanoma

STAD 21 Stomach Adenocarcinoma

TGCT 9 Testicular Germ Cell Tumors

THCA 14 Thyroid Carcinoma

UCEC 13 Uterine Corpus Endometrial Carcinoma

Table 4.1. The abbreviations used, the amount of samples from that class and their full names [24].

cancers, have less than ten samples featured in the data. The disparity between can- cer types featured in the data is further visualized in figure 4.1. The uneven division of classes was taken into consideration in training the models. The logistic regression and random forest classifiers were set to automatically set the classes’ weights to be inversely proportional to the number of samples the class featured in the training set [15, 25]. The MLP application programming interface (API) did not offer such an option but the model was able to perform well regardless [26].

In order to train and test the classifiers the data was split into training and testing sets. The data set was split followingly: 80 % of the samples in each class were used in training and the remaining 20 % was used in the testing. The same split was used for all classifiers.

First, the data was used without biological restrictions. All classifiers performed well, which was expected given the distinguishable peaks in each classes chromatin landscape [3].

(17)

Figure 4.1. Class division in the data set.

In the next experiments, the data was limited by two conditions:

1. Values measured -250 kbp - 250 kbp distance from TSS 2. Values measured -1 kbp - 100 bp distance from TSS.

The first condition limits the data to the area of intergenic DNA. Intergenic DNA is a site that is active in transcription [7]. The second condition limits the data to the promoters of genes. [27]

The limiting of the data to the areas of intergenic DNA and promoters was conducted with an annotation software HOMER. HOMER annotates each sample with genomic details concerning the measurement, including the distance from TSS [27].

The first condition didn’t limit the dimension of the samples by much, 533 185 out of 562 709 values were still used in classification. The second condition was more limiting, leaving 30 238 of the 562 709 values available for classification. Still, both limitations leave a high dimension of data for the classifiers to operate on.

4.2 Results

Table 4.2 presents the mean classification scores of the classifiers for classification of entire DNA, intergenic DNA and promoters. The scores show that machine learning methods are suitable tools for predicting gene expressions of cancer cells based on the structure of the chromatin. The results also show that limiting the chromatin structure data to only areas of intergenic DNA or genes’ promoters is enough to reach equally high

(18)

Classifier All features Intergenic DNA Promoters

Logistic regression 90,9 % 91,0 % 90,7 %

Multilayer perceptron 92,7 % 90,1 % 90,4 %

Random forest 90,0 % 90,4 % 88,9 %

Table 4.2. The mean scores of the classifiers.

classification accuracies.

The errors of the classifiers were consistent. The same cancer types were misclassified similarly across the models. The confusion matrices in figures 4.2, 4.3 and 4.4 show the true and predicted labels for the classifications. The cancer types that caused the most errors were ESCA, HNSC, LGGx and GBMx. These errors can be considered fairly logical. Cancer in the esophagus mistaken for cancer in the head-neck area is most likely a result of the areas being located in very close proximity in the human body. The same reasoning can be applied to LGGx and GBMx. The two types of brain cancers are categorized by the aggressivity of the cancer cells, but they belong to the same group of tumours, gliomas [28].

While the classes with a small amount of samples available in the data set proved difficult for the classifiers to learn, some of the more common cancer types also showed some errors. For better representation of the ability to predict the less common cancer types, the data set would need more samples from those classes.

(19)

Figure 4.2.Confusion matrix from logistic regression classification with all features.

Figure 4.3. Confusion matrix from multilayer perceptron classification with all features.

(20)

Figure 4.4.Confusion matrix from random forest classification with all features.

The same errors were present in the biologically limited classification cases. In addition, LUSC and ESCA were misclassified with a likely explanation of the error being logical.

LUSC and ESCA are cancers that appear in the lungs and the esophagus, both being in near proximity in the body. Other errors included BLCA, HNSC, COAD and STAD.

The connection between those misclassifications isn’t as easily found. The confusion matrices of the classifications with the limited features are presented in the figures 4.5, 4.6 and 4.7.

The figures show similar spread of predictions as the previous matrices. The most notable differences are found in the MLP’s confusion matrices where the limitation of the features had the largest negative impact on the classification score.

(21)

Figure 4.5.Confusion matrices from logistic regression classification with features limited to intergenic DNA (left) and promoters (right).

Figure 4.6. Confusion matrices from MLP classification with features limited to intergenic DNA (left) and promoters (right).

Figure 4.7. Confusion matrices from random forest classification with features limited to intergenic DNA (left) and promoters (right).

(22)

5 CONCLUSIONS

The early detection of cancer leads to better chance of successfully treating the disease and increases the probability of recovery. The ability to predict gene expressions could lead to being able to detect the risk of cancer from the structure of cells’ DNA. This study attempted to shed light on the suitability of machine learning methods for gene expression prediction.

Gene expressions’ regulation is a large and a complex process and using chromatin structure as the only regulatory element would be an over-simplification. Still, ATAC- seq provides excellent data that is useful in studying the effects of chromatin structure in gene expression. Another benefit of ATAC-seq is the amount of data generated from it. Machine learning methods often need large quantities of data and being able to pro- duce data in such high-volume is beneficial when trying to create applications based on learning algorithms.

In classifying tumor samples based on ATAC-seq samples all tested methods were proven to be well suited for the task, with each classifier reaching at least 90 % classification accuracy. The absolute highest accuracy was achieved with the multilayer perceptron.

However, logistic regression and random forests are also a worthwhile option for future studies given their ability to examine which features were the most important for the classification. This could be used in further studying which sites of the chromatin are most important regulatory elements for each cancer type.

Limiting the data available to the models didn’t affect their performance significantly. The ML algorithms were still able to achieve the same levels of accuracy even when the chromatin data was limited to genes’ promoters or areas of intergenic DNA. This shows that those sites alone provide enough information for classification.

All in all, the field of machine learning in medicine is still in its very beginning. With the amount of data being produced and stored it would seem possible that in the future machine learning applications could help doctors make diagnosis and possibly begin preventative treatments before patients start showing symptoms.

(23)

[1] Latest global cancer data: Cancer burden rises to 18.1 million new cases and 9.6 million cancer deaths in 2018. International Agency for Research on Cancer (Sept. 12, 2018). URL:https : / / www . iarc . fr / wp - content / uploads / 2018 / 09 / pr263_E.pdf(visited on 03/18/2020).

[2] National Cancer Institute. Cancer Screening. URL: https : / / www . cancer . gov / about-cancer/screening(visited on 03/19/2020).

[3] Corces, M. R., Granja, J. M., Shams, S., Louie, B. H., Seoane, J. A., Zhou, W., Silva, T. C., Groeneveld, C., Wong, C. K., Cho, S. W., Satpathy, A. T., Mumbach, M. R., Hoadley, K. A., Robertson, A. G., Sheffield, N. C., Felau, I., Castro, M. A. A., Berman, B. P., Staudt, L. M., Zenklusen, J. C., Laird, P. W., Curtis, C., Greenleaf, W. J. and Chang, H. Y. The chromatin accessibility landscape of primary human cancers. Science 362.6413 (2018). Ed. by R. Akbani, C. C. Benz, E. A. Boyle, B. M. Broom, A. D. Cherniack, B. Craft, J. A. Demchok, A. S. Doane, O. Elemento, M. L. Ferguson, M. J. Goldman, D. N. Hayes, J. He, T. Hinoue, M. Imielinski, S. J. M.

Jones, A. Kemal, T. A. Knijnenburg, A. Korkut, D.-C. Lin, Y. Liu, M. K. A. Mensah, G. B. Mills, V. P. Reuter, A. Schultz, H. Shen, J. P. Smith, R. Tarnuzzer, S. Tref- flich, Z. Wang, J. N. Weinstein, L. C. Westlake, J. Xu, L. Yang, C. Yau, Y. Zhao and J. Zhu. ISSN: 0036-8075. DOI: 10 . 1126 / science . aav1898. eprint: https : / / science . sciencemag . org / content / 362 / 6413 / eaav1898 . full . pdf. URL: https://science.sciencemag.org/content/362/6413/eaav1898.

[4] Sosnovshchenko, A. Machine Learning with Swift. eng. Packt Publishing, 2018.

ISBN: 1-78712-151-8.

[5] Deo, R. C. Machine Learning in Medicine.Circulation132 (20 2015).

[6] Miglani, G. S. Gene expression. eng. Oxford, U. K: Alpha Science International Ltd., 2014. Chap. 7,14.ISBN: 1-78332-058-3.

[7] Miglani, G. S. Gene regulation. eng. Oxford, U.K: Alpha Science International, 2013. Chap. 4.ISBN: 1-78332-006-0.

[8] Wolffe, A. Chromatin structure and function. eng. 3rd ed. San Diego: Academic Press, 1998. Chap. 2.ISBN: 1-280-58293-6.

[9] OpenStax. Eukaryotic Epigenetic Gene Regulation. Mar. 23, 2016. URL: https : / / cnx . org / contents / GFy _ h8cu @ 10 . 8 : ES2pStNH @ 5 / Eukaryotic - Epigenetic - Gene-Regulation(visited on 04/20/2020).

[10] Thibodeau, A., Uyar, A., Khetan, S., Stitzel, M. L. and Ucar, D. A neural network based model effectively predicts enhancers from clinical ATAC-seq samples. Sci- entific Reports 8.1 (2018), 16048. ISSN: 2045-2322. DOI:10.1038/s41598- 018- 34420 - 9. URL: https : / / doi . org / 10 . 1038 / s41598 - 018 - 34420 - 9 (visited on 03/26/2020).

(24)

[11] Nair, S., Kim, D. S., Perricone, J. and Kundaje, A. Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. eng.Bioinformatics (Oxford, England)35.14 (July 2019).

5529138[PII], i108–i116.ISSN: 1367-4811.DOI:10.1093/bioinformatics/btz352.

URL:https://doi.org/10.1093/bioinformatics/btz352.

[12] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blon- del, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour- napeau, D., Brucher, M., Perrot, M. and Duchesnay, E. Scikit-learn: Machine Learn- ing in Python.Journal of Machine Learning Research12 (2011), 2825–2830.

[13] Hastie, T., Tibshirani, R. and Friedman, J.The elements of statistical learning : data mining, inference, and prediction. eng. 2. ed. Springer series in statistics. New York:

Springer, 119–128, 587–597, 282.ISBN: 978-0-387-84857-0.

[14] Osborne, J. W.Best Practices in Logistic Regression. pages 1-18. 55 City Road, London: SAGE Publications, Ltd, Apr. 2015.DOI:10.4135/9781483399041.

[15] Scikit-learn.Logistic regression documentation.URL:https://scikit-learn.org/

stable/modules/generated/sklearn.linear_model.LogisticRegression.html (visited on 04/29/2020).

[16] Scikit-learn. Logistic function. URL:https : / / scikit - learn . org / stable / auto _ examples/linear_model/plot_logistic.html(visited on 04/22/2020).

[17] Chandramouli, S., Dutt, S. and Kumar Das, A. Machine Learning. eng. Pearson Education India, 2018. Chap. 10.ISBN: 93-89588-13-8.

[18] Vanneschi, L. and Castelli, M.Encyclopedia of Bioinformatics and Computational Biology. 2019, 612–620.

[19] Manaswi, N. K.Deep Learning with Applications Using Python Chatbots and Face, Object, and Speech Recognition With TensorFlow and Keras. eng. 1st ed. 2018.

Berkeley, CA: Apress. Chap. 3.ISBN: 1-4842-3516-9.

[20] Hassan, H., Negm, A., Zahran, M. and Saavedra, O. ASSESSMENT OF ARTIFI- CIAL NEURAL NETWORK FOR BATHYMETRY ESTIMATION USING HIGH RES- OLUTION SATELLITE IMAGERY IN SHALLOW LAKES: CASE STUDY EL BU- RULLUS LAKE.International Water Technology Journal 5 (Dec. 2015).

[21] Scikit-learn. Ensemble methods. URL: https : / / scikit - learn . org / stable / modules/ensemble.html(visited on 04/24/2020).

[22] National Cancer Institute.The Cancer Genome Atlas Program.URL:https://www.

cancer . gov / about - nci / organization / ccg / research / structural - genomics / tcga.

[23] Buenrostro, J. D., Wu, B., Chang, H. Y. and Greenleaf, W. J. ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide.Current protocols in molecular biology109 (Jan. 2015). PMC4374986[pmcid], 21.29.1–21.29.9.ISSN: 1934-3647.

DOI: 10.1002/0471142727.mb2129s109.URL: https://www.ncbi.nlm.nih.gov/

pmc/articles/PMC4374986/(visited on 04/24/2020).

(25)

stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html (visited on 04/29/2020).

[26] Scikit-learn.Multilayer perceptron documentation.URL:https://scikit- learn.

org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html (visited on 04/29/2020).

[27] Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P., Cheng, J. X., Murre, C., Singh, H. and Glass, C. K. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. eng.Molecular cell 38.4 (May 2010). S1097-2765(10)00366-7[PII], 576–589. ISSN: 1097-4164. DOI:10.1016/j.molcel.2010.05.004. URL:https:

//doi.org/10.1016/j.molcel.2010.05.004.

[28] Vigneswaran, K., Neill, S. and Hadjipanayis, C. G. Beyond the World Health Organi- zation grading of infiltrating gliomas: advances in the molecular genetics of glioma classification. eng. Annals of translational medicine 3.7 (May 2015). atm-03-07- 95[PII], 95–95. ISSN: 2305-5839. DOI:10.3978/j.issn.2305- 5839.2015.03.57.

URL:https://doi.org/10.3978/j.issn.2305-5839.2015.03.57.

Viittaukset

LIITTYVÄT TIEDOSTOT

• elective master’s level course in specialisation area Algorithms and Machine Learning, continues from Introduction to Machine Learning.. • Introduction to Machine Learning is not

• elective master’s level course in specialisation area Algorithms and Machine Learning, continues from Introduction to Machine Learning.. • Introduction to Machine Learning is not

• elective master’s level course in specialisation area Algorithms and Machine Learning, continues from Introduction to Machine Learning.. • Introduction to Machine Learning is not

This study applied some machine learning methods to determine predicted treatment outcomes and risk factors associated with TB patients. In this regard, five machine learning

The purpose of this thesis is to study benefits of using machine learning methods in bankruptcy prediction instead traditional methods such as logistic regression and Z-score

The concepts include KYC process, identity document recognition, artificial intelligence, Machine Learning, decision tree, Random Forest, Deep Learning, transfer

Then the data was used to generate regression models with four different machine learning methods: support vector regression, boosting, random forests and artificial neural

(2002) Modern Applied Statistics with S. Springer, New York. A permutation test to compare receiver operating characteristic curves. Cost curves and supply curves. The effects