• Ei tuloksia

Disentangling the characteristics of chromatin state and effects of novel therapeutics in acute lymphoblastic leukemia subtypes

N/A
N/A
Info
Lataa
Protected

Academic year: 2022

Jaa "Disentangling the characteristics of chromatin state and effects of novel therapeutics in acute lymphoblastic leukemia subtypes"

Copied!
65
0
0

Kokoteksti

(1)

DISENTANGLING THE CHARACTERISTICS OF CHROMATIN STATE AND EFFECTS OF NOVEL THERAPEUTICS IN ACUTE LYMPHOBLASTIC LEUKEMIA SUBTYPES

Sini Hakkola Master of Science thesis Master’s Degree Programme in Biomedicine University of Eastern Finland

Faculty of Health Sciences School of Medicine 4.8.2021

(2)

University of Eastern Finland, Faculty of Health Sciences, School of Medicine Master’s Degree Programme in Biomedicine

Sini Hakkola: Title of the Thesis

Master of Science thesis; 47 pages, 16 pages of supplementary material

Supervisors: Merja Heinäniemi, Mari Lahnalampi and Einari Niskanen, Institute of Biomedicine in University of Eastern Finland

4.8.2021

Keywords: single cell ATAC sequencing, bioinformatics, acute lymphoblastic leukemia, chromatin accessibility, pre-processing, machine learning

Abstract

Acute lymphoblastic leukemia is the most common pediatric malignancy and genetically a highly heterogenous disease. Single-cell next generation sequencing assays have potential to advance the development of novel ALL therapies through high-resolution view on individual cancer cells. Single-cell assay for transposase-accessible chromatin with high throughput sequencing (scATAC-seq) is a method for investigating chromatin accessibility genome-wide.

This project aimed to characterize the cell-to-cell heterogeneity and control of gene expression in two leukemic subtypes and two cell lines treated with promising therapeutics. A data analysis was conducted with Cell Ranger ATAC, cisTopic, Homer and GREAT. In RS4;11 cell line treated with a promising new therapeutic, Wee1 kinase inhibitor AZD1775, the activation of p53 was recognized in many cells. Differences between RUNX1 motif accessibility between cells was also detected. Similar results have been obtained in our group by analysing the same data with an another scATAC-seq analysis tool. These results have been experimentally validated, which indicates that the workflow used in this project was able to produce reasonable results. This project was also able to characterize intra-sample heterogeneity in bone marrow mononuclear cells in DUX4/ERG B-ALL subtype, including cell clusters representing different mononuclear cell types of the bone marrow with corresponding enrichment in transcription factor motif activity. These results may have relevance in disentangling new diagnostic and treatment approaches for this subtype in the future.

(3)

1 Introduction

Cancer is a disease characterized by high heterogeneity, which refers to subpopulations of cells with varying genotypes and phenotypes. Heterogeneity leads to distinct biological behaviour among individual cancer patients and even within the primary cancer and its metastases which is the result of distinct cancer cell subpopulations (Fisher et al, 2013;

Dagogo-Jack & Shaw, 2018). Moreover, due to their dynamic nature, cancers frequently become more heterogenous as the disease progresses (Dagogo-Jack & Shaw, 2018). Cancer heterogeneity significantly contributes to the variation in the response to therapies (Zhang et al, 2018). Therefore, the assessment of the heterogeneity is essential when developing new cancer therapies (Dagogo-Jack & Shaw, 2018).

Acute lymphoblastic leukemia (ALL) is a hematopoietic cancer resulting from a malignant transformation of B or T lymphoid progenitor cells involving abnormal differentiation and proliferation of the cells (Terwilliger & Abdul-Hay, 2017). Most cases develop from B lymphoid progenitors and the transformation of the cells causes the accumulation of malignant, poorly differentiated lymphoid cells in the bone marrow and blood stream, from which they can spread to other body parts, including spleen, lymph nodes and liver. ALL is the most common in pediatric patients and approximately 60% of ALLs are diagnosed before the age of 20, but it can manifest at any age (Terwilliger & Abdul-Hay, 2017; Malard

& Mohty, 2020). Most ALL cases arise in healthy individuals, but environmental risk factors during pregnancy and childhood such as infections, ionizing radiation and pesticide exposure as well as several inherited genetic syndromes have been identified to predispose individuals to ALL (Malard & Mohty, 2020). Predisposing genetic syndromes include Down’s syndrome, Fanconi anaemia and several inherited gene variants (e.g. CEBPE, ETV6, IKZF1, CDKN2A, CDKN2B), to name a few (Malard & Mohty, 2020). In the majority of childhood ALLs, the leukemogenic process starts in utero through gain or deletion of genetic material or formation of chromosomal translocations (Malouf & Ottersbach, 2018).

ALL is genetically a highly heterogenous disease, which compromises multiple subtypes with distinct genomic alterations, including single nucleotide variants (SNVs), chromosomal translocations and changes in chromosome number (Roberts, 2018; Rehn et al, 2020;

Terwilliger & Abdul-Hay, 2017). The targets of the aberrations include tumour suppressor genes, cell cycle regulators, lymphoid transcription factors (TFs) and chromatin modifiers

(4)

2 (Roberts, 2018). The prognosis of ALL is affected by the altered gene and the type and prevalence of the alteration, which vary between ALL subtypes and which also affect the treatment options of the disease (Roberts, 2018). Most pediatric B-ALL–cases, approximately 60%, comprise of hypodiploidy (< 45 chromosomes), high hyperdiploidy (51-67 chromosomes), translocations t(9;22) BCR-ABL1, t(1;19) TCF3-PBX1, Philadelphia chromosome-like (Ph-like) ALL, t(12;21) [ETV6-RUNX1] and rearrangement of mixed lineage leukemia gene KMT2A/MLL (MLLr) (Hunger & Mullighan, 2015; Tasian et al, 2015). The treatment outcome in pediatric ALL cases has improved significantly during the past decades, however, the dose intensity of conventional chemotherapy has reached its limits and it will be essential to replace chemotherapy with less toxic and more precise therapeutics in order to improve the therapy outcome of certain patients (Pui, 2020).

When occurring in adolescents, young adults and adult population, ALL is a challenging disease to treat due to increase in genotypes associated with poor outcome (e.g. BCR-ABL1) and decrease in phenotypes associated with favourable outcome (e.g. ETV-RUNX1, high hyperdiploidy) (Roberts, 2018). The long-term prognosis and survival in adults and adolescents is dismal and the prognosis and cure rates decline with increasing age (Roberts, 2018).

Relapse is the main reason for the failure of pediatric ALL treatment and the survival after relapse is relatively poor, approximately 40 – 70% depending on the risk groups and follow- up time (Oskarsson et al, 2016). Relapse of ALL occurs in approximately 20% of pediatric cases, and more than 50% of the adult patients and increases the risk of chemoresistance and death (Coccaro et al, 2019). Relapsed ALL is one of the most common causes of cancer mortality in pediatric patients (Heikamp & Pui, 2018). Risk stratification of genetic factors and other features (e.g., age of < 1 year or >10 years is considered a high risk) of pediatric ALL cases is used to optimize the type, duration, and the intensity of treatment. The risk (low, intermediate, high) reflects the patient’s response and toxicity of chemotherapy agents and the overall survival (Heikamp & Pui, 2018). For instance, TP53 mutations, MLL rearrangements and hypodiploidy (< 44 chromosomes) are considered as high-risk alterations and TCF3-PBX1 fusion and DUX4/ERG deregulation as intermediate-risk alterations in B-ALL (Heikamp & Pui, 2018). DUX4/ERG subtype frequently carries TP53 mutations, which are associated with poor prognosis (Ueno et al, 2020). Conventional cancer therapies are also suggested to pose a risk for relapse in a subset of pediatric ALLs.

(5)

3 Chemotherapy can facilitate the relapse in certain pediatric ALL cases by inducing the development of mutations causing resistance towards drugs (Li et al, 2020). Certain ALL subtypes have been noticed to undergo a switch towards myeloid lineage and loss of the immunophenotype of B-cells at early stage of treatment (Novakova et al, 2020). According to Novakova et al, 2020 the monocytic switch does not affect the risk of relapse or the prognosis of ALL with standard treatment. However, the gradual loss of B lymphocyte antigen CD19 expression accompanied by the switch may reduce the efficacy of CD19- directed chimeric antigen receptor T cells (CAR-T) leading to declined response to treatment (Jacoby et al, 2016; Novakova et al, 2020).

Next-generation sequencing (NGS) techniques are genome-wide sequencing methodologies increasingly used to uncover the aspects of cancer heterogeneity (Zhang et al, 2018). In the clinic, NGS can be used for detecting causative genomic alterations in hereditary cancers and other pathogenic mutations and variants, including translocations, inversions, insertions, deletions, and copy number variants. The information can then be used to guide the medical intervention of the disease (Neu et al, 2019). A number of NGS assays have been developed based on the complexity of the analysis as well as on the information to be obtained from the sample genome-wide, for instance on the transcriptome, chromatin accessibility, chromatin organization, and TF binding and histone modifications, whole genome and exome (Table 1) (Buenrostro et al, 2013; Thermes, 2014; Coccaro et al, 2019). The development of NGS methods has increased the knowledge regarding genomic heterogeneity of ALL and defining both somatic alterations and the role of inherited gene variants, with an increasing focus on the selection of molecular biomarkers, development of new therapeutics as well as clinical decision making (Coccaro et al, 2019; Heikamp & Pui, 2018).

(6)

4

Table 1. Examples of genome-wide next generation techniques. Large number of NGS applications have developed in last two decades and they vary based on the complexity of the analysis and the information that is obtained from the cells.

Genomic DNA of eukaryotes is packed as chromatin, and the basic structural unit of DNA packaging is a nucleosome, in which 145-147 base pairs (bp) of DNA is wound around eight histone proteins, which allows tight packaging of DNA in the nucleus but also regulates the transcription by sterically inhibiting the TFs and basal transcription machinery binding to DNA. TFs are proteins that are able to bind DNA in on specific sequences, motifs, at control elements (cis-regulatory elements, CREs), and control the transcription of the genetic information from DNA to RNA (Lambert et al, 2018; Vaquerizas et al, 2009; Ihn Lee &

Young, 2013). Cell function and response to the cellular environment is determined by the activity of TFs (Vaquerizas et al, 2009). Many TFs act as master regulators, which are expressed at relatively high levels compared to other TFs, and positively regulate the transcription of cell-type-specific genes and inhibit the transcription of genes specifying other cell types (Ihn Lee & Young, 2013). TFs regulate gene expression either by directly binding to gene promoters or more often, by binding to genomic regulatory elements called enhancers, and either activate or repress the transcription from core promoters by looping of the DNA between the enhancer and core promoters and recruiting coregulatory proteins and RNA polymerase II to target genes (Ihn Lee & Young, 2013; Reiter et al, 2017). Thus, TFs can regulate both nearby and distal genes (Ihn Lee & Young, 2013). As genetical aberrations of ALL often target lymphoid TFs, the identification of TFs impacting the disease development and maintenance of leukemia-specific cell states is necessary. Analysis of open chromatin and TF binding could enable the development of improved ALL therapeutics, as ATAC-seq data can be used to observe multiple aspects of transcriptional regulation (Roberts, 2018; Mehtonen et al, 2020). For instance, inhibition of CBFβ in leukemias driven

Method Information

Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq)

Chromatin accessibility across the genome Chromatin immunoprecipitation followed by sequencing

(ChIP-seq)

Protein interactions with DNA Chromatin conformation capture protocol using proximity

ligation (Hi-C)

Chromatin organization and interactions

Global run-on sequecing (GRO-seq) Binding sites of transcriptionally active RNA polymerase II RNA-sequencing (RNA-seq) Presence and quantity of RNA, i.e. the transcriptome Whole genome sequencing (WGS) The complete nucleotide sequence of genomic DNA Whole exome sequencing (WES) Protein-coding regions of a genome

(7)

5 by RUNX TF (Illendula et al, 2016) and inhibition of interaction between Menin protein and a regulator of transcription, the histone methyltransferase MLL1 (KMT2A) in MLL- rearranged leukemias (Krivtsov et al, 2019) have been suggested as potential therapeutic strategies in ALL.

Assay For Transposase Accessible Chromatin with high-throughput sequencing (ATAC- seq) is an in vitro deep sequencing technology for assaying open chromatin genome-wide.

ATAC-seq incorporates hyperactive prokaryotic Tn5 transposase to simultaneously tag and fragment open genomic regions with sequencing adaptors (Buenrostro et al, 2013). ATAC- seq protocol consists of three key steps: cell lysis, transposing the nuclei with hyperactive Tn5 transposase and amplification with PCR (Fig. 1) (Pott & Lieb, 2015). ATAC-seq allows identification of multiple genomic features, including promoters and enhancers. In the resulting signal profile open chromatin is represented as peaks (Baek & Lee, 2020). The ATAC-seq data can be used to identify active TFs and to recognize TFs that are important for certain cell states and consequently, which TFs could be potential drug targets.

When examining heterogenous tissues with numerous cell states, single cell level can capture the heterogeneity of the cells better than bulk data, which represents only the average of the cell population. In addition to capturing cell-to-cell heterogeneity, single-cell resolution enables the characterization of the molecular regulatory programs responsible for the variation between cells (Chen et al, 2019). The development of single-cell ATAC-seq (scATAC-seq) has made it possible to study chromatin accessibility at single cell resolution (Buenrostro et al, 2015; Satpathy et al, 2019). Two major approaches to obtain single cell resolution in ATAC-seq have been developed: microfluidics-based approach and split-pool cellular indexing (Satpathy et al, 2019; Baek & Lee, 2020). The sample preparation and library generation for single-cell ATAC-seq starts with nuclei isolation, after which the nuclei are transposed in bulk with Tn5 transposase (Fig. 1). In 10 x Genomics Chromium system, the nuclei are encapsuled in gel droplets or GEMs (gel bead on emulsion) containing oligomers with a unique barcode, which are added to DNA fragments on a microfluidic chip so that each barcode sequence represents a unique cell. After the droplet generation, the gel beads are dissolved, emulsion is broken and the tagged DNA fragments are PCR amplified and sequenced with high-throughput next generation sequencing platforms (Buenrostro et al, 2013; Satpathy et al, 2019; Baek & Lee, 2020). The active regulatory elements are highly cell type specific, which makes scATAC-seq an attractive method for understanding cell-to-

(8)

6 cell heterogeneity and control of gene expression (Granja et al, 2020). However, due to low copy numbers of DNA (two in diploid organism), scATAC-seq data is intrinsically noisy and sparse. Also, there is no best practice in scATAC-seq data analysis, regardless of numerous new methods and software tools that have been developed since scATAC-seq methodology was presented in 2015 (Baek & Lee, 2020; Buenrostro et al, 2015).

Figure 1. From averages to high-resolution. Comparison of sample preparation and library construction for ATAC-seq and scATAC-seq. ATAC-seq protocol involves transposing the cell nuclei with hyperactive Tn5 transposase, which simultaneously tags and fragments the open chromatin. The tagged DNA fragments are PCR amplified and sequenced with next generation sequencing platforms. Single-cell ATAC-seq starts with nuclei isolation, after which the nuclei are transposed with Tn5 transposase. In 10 x Genomics Chromium system, oligomers with a unique barcode are added to DNA fragments on a microfluidic chip so that each barcode sequence represents a unique cell. Fragments are amplified with PCR and sequenced. (Adapted from: Satpathy et al, 2019; Pott & Lieb, 2015).

Before downstream analysis, the scATAC-seq data must go through pre-processing of the reads, which involves demultiplexing of the data, adaptor trimming and alignment to reference genome (Baek & Lee, 2020). The 10 x Genomics Chromium Single Cell ATAC solution is a droplet-based technology, which encapsules single nuclei transposed by Tn5 transposase in gel droplets. The droplets contain oligomers with a unique barcode, which

(9)

7 are added to DNA fragments (Baek & Lee, 2020; Chen et al, 2019). All accessible DNA fragments from individual nuclei share a common barcode (Yan et al, 2020). Cell Ranger is a set of analysis tools provided by 10x Genomics for processing single cell data, and it includes workflows developed for scATAC-seq (What is Cell Ranger ATAC?, n.b.). Once the single cell data is available from the sequencer, Cell Ranger ATAC pipeline performs the above-mentioned steps and platform-specific cell-barcode assignment. Peak calling, cell-barcode detection, initial cell clustering and visualization are performed with cellranger- atac count algorithm. Peak calling is performed to distinguish the features of interest, the signal from open regions of genome, from the background noise generated by random Tn5 activity. A mathematical model is used to capture signal and noise, and to separate the barcodes associated with real cells from non-cells. ATAC produces a set of output files (Table 2), which can be uploaded in third-party analysis tools for downstream analysis (Single-Library Analysis with cellranger-atac count, n.b.)

Table 2. Description of Cell Ranger output and the file names. cisTopic can use as an input either singlecell.csv, peaks.bed and fragments.tsv.gz or possorted_bam.bam and filtered_peak_bc_matrix.

Quality control (QC) of scATAC-seq data includes correcting batch effects and removing barcodes with low or too high count-depth corresponding low-quality cells or doublets, respectively, which is often based on count depth and feature counts per barcode. Cell Ranger ATAC performs initial QC and downstream analysis tools also offer different steps for performing QC. Therefore, QC-metrics can also include enrichment of transcription start sites (TSS), ratio of reads in blacklist regions and nucleosome banding pattern. High-quality

Cell Ranger ATAC output description and file names singlecell.csv:

per-barcode fragment counts & metrics

web_summary.html:

HTML file summarizing data & analysis possorted_bam.bam:

position sorted BAM file representing the aligned

possorted_bam.bam.bai:

position sorted BAM file index representing the aligned reads peaks.bed:

BED file of all called peak locations

summary.json:

summary of all data metrics filtered_peak_bc_matrix:

filtered peak barcode matrix in mex format

filtered_peak_bc_matrix.h5:

filtered peak barcode matrix in hdf5 format filtered_tf_bc_matrix:

filtered tf barcode matrix in mex format

filtered_tf_bc_matrix.h5:

filtered tf barcode matrix in hdf5 format analysis:

directory of analysis files

cloupe.cloupe:

loupe browser input file fragments.tsv.gz:

barcoded and aligned fragment file

fragments.tsv.gz.tbi:

fragment file index peak_annotation.tsv:

peak-gene associations based on genomic proximity

peak_motif_mapping.bed:

peak-motif associations raw_peak_bc_matrix.h5:

Raw peak barcode matrix in hdf5 format

raw_peak_bc_matrix:

raw peak barcode matrix in mex format summary.csv:

summary metrics in CSV format

(10)

8 ATAC-seq data should be enriched by nucleosome-free fragments and around TSSs of the genes (Yan et al, 2020). Therefore, a successful ATAC-seq data produces a fragment size distribution plot with peaks at <100 bp corresponding to nucleosome-free regions and smaller peaks at approximately 200, 400 and 600 bp corresponding to mono-, di- and tri- nucleosomes, respectively. (Yan et al, 2020). Nucleosome-free regions should be enriched around the TSSs of genes. TSS enrichment score is a measure of signal-to-background in ATAC-seq data, and it refers to the relative enrichment of Tn5 insertions at TSS of genes compared to local background (Granja et al, 2021). TSS enrichment score is used as a QC metrics since the ATAC-seq data is generally more enriched around TSS regions compared to other genomic regions (Granja et al, 2021). Fragments are created by two transposition events by Tn5, and they are the DNA molecules measured by the ATAC-seq assays (Granja et al, 2021). The cells that pass the QC step are used to generate a feature matrix, which consists of the read counts per each cell (columns) and in each open chromatin peak feature (rows) (Chen et al, 2019). Data transformation and dimensionality reduction are then performed on the feature matrix (Chen et al, 2019).

cisTopic was selected as the main downstream analysis tool of the datasets. It was presented as one of the best performing computational methods for analysing scATAC-seq data in 2019 (Chen et al, 2019). cisTopic is a probabilistic Bayesian framework which discovers stable cell states and co-accessible enhancers in scATAC-seq data by co-optimized assignment of regulatory regions and cells into latent topics (Bravo González-Blas et al, 2019). In simplified terms, the open chromatin regions are first classified into regulatory topics and the topic contribution is then used to cluster the cells (Bravo González-Blas et al, 2019). The basic principle of the methodologies and input data processing are further outlined below.

cisTopic defines the regions based on single-cell BAM files representing the aligned read sequences and peak calling of defined regulatory regions from aggregated scATAC-seq profile or reference bulk profile. cisTopic can also utilize a precomputed matrix with regions as rows, cells as columns and read counts as values or a BED file with potential regulatory regions and fragments file from Cell Ranger ATAC count (Bravo González-Blas et al, 2019;

Chen et al, 2019). For data transformation cisTopic uses binarization of read counts, which assumes that in diploid genomes, each site is present twice at most and a region is considered accessible if at least one read is found per cell (Chen et al, 2019; Bravo González-Blas et al,

(11)

9 2019). The binarized matrix is then used to cluster the cells and regions and for further analyses, e.g., epigenomic enrichment, motif and pathway enrichment analyses (Buenrostro et al, 2015).

The basis of the statistical methods used for analysing scATAC-seq data by several tools are in text mining. Classical text analysis methods for data transformation, dimensionality reduction and clustering are applicable for analysing scATAC-seq data due to its binary nature, enabling the use of methodologies such as binarization for feature transformation, which makes the data analysis more efficient.

Term frequency-inverse document frequency (TF-IDF) is a data transformation technique used for scATAC-seq data analysis. It is designed for information retrieval from text documents by evaluating the relevancy of a word to a document in a collection of documents.

Similarly, it can be used to transform a scATAC-seq feature matrix by giving more weight for more uncommon features (peaks) in a dataset, which enables choosing the more variable and therefore, more informative features for downstream analysis (Pervolarakis et al, 2020).

Dimensionality reduction refers to the process of reducing the number of input variables in a dataset while keeping as much of the variation present in the original dataset (Nguyenid &

Holmesid, 2019). In the context of scATAC-seq, dimensionality reduction aims to simplify and denoise the high-dimensional datasets while keeping the signals of interest, i.e., the accessible genomic regions (Chen et al, 2019). Each cell in an scATAC-seq data set is made up of a set of accessible regions, like a text document consists of a set of words. Latent Semantic Indexing (LSI) is an example of a mathematical model used for reducing the dimensionality in scATAC-seq originally developed for text mining and it was first introduced for scATAC-seq data by Cusanovich et al, 2015 and is also utilized by scATAC- seq analysis package Signac (Stuart et al, 2019). LSI uses the TF-IDF transformed data as an input for singular value decomposition (SVD) to search for related features to deeply understand the content on a document (or a cell) (Chen et al, 2019). It relies to the distributional hypothesis, according to which features with similar meaning frequently appear together, which can also be applied to scATAC-seq data. The SVD generates a score matrix, based on which the dataset can be reduced to smaller subsets of features that best predict the data, which reduces its dimensionality. Dimensionality can be further reduced by e.g., uniform manifold approximation and projection (UMAP) (Chen et al, 2019).

(12)

10 The non-ordered scATAC-seq data also fulfils the “bag of words” assumption behind the technique used in cisTopic called Latent Dirichlet Allocation (LDA). In LDA, the order of the words or their grammatical role is disregarded, and each document is processed as a collection of features. LDA is a Bayesian unsupervised statistical model used for discovering abstract, natural groups of topics in a process called topic modelling (Blei et al, 2003;

duVerle et al, 2016). The origins of LDA are also in text mining and it is applicable to any collections of discrete data (Blei et al, 2003; duVerle et al, 2016). Briefly, LDA assumes a fixed number of topics representing a set of words in a group of documents (Fig. 2). The purpose is to map all the topics into documents based on the idea that topics represent a more abstract representation of similar words. In terms of scATAC-seq data, the cells play the role of documents, regions represent the words and the occurrence of a particular accessible region in the dataset corresponds to the number of times a single word is used in a document.

Each word (in scATAC-seq, a region) is represented by its probability to belong to a topic (Blei et al, 2003), i.e. a vector with length that corresponds to number of topics. A similar probability vector is also estimated for each document (in scATAC-seq, a cell), that represents the probability of the topic in the document. The topics represent abstract groups based on which the single cells and regions are grouped into subtypes. The dimensionality of the data is reduced as the data is modelled on the level of topics (regions-topic matrix or cell-topic matrix) instead of the original number of features (word-document or region-cell matrix). Mathematically, these probabilities are represented as two sets of Dirichlet distributions that best explain the data. The characteristics of scATAC-seq data, for instance, the non-ordered features, fulfil the assumptions LDA has for the data (Chen et al, 2019;

González-Blas & Aerts, 2020). The sparsity of the topic model can be controlled by the values of hyperparameters α and β that correspond to the scale and base parameters of the Dirichlet distribution (Blei et al, 2003; duVerle et al, 2016). In cisTopic v3, the topic modelling is based on WarpLDA, which is a fast, improved algorithm for LDA Allocation (Chen et al, 2016). As the memory capacity and the performance of computational processors have evolved, WarpLDA has been developed based on LDA to optimize the performance of the algorithm to better utilize the memory capacity of the processors.

(13)

11

Figure 2. Schematic overview of Latent Dirichlet Allocation (LDA). LDA assumes a fixed number of topics representing a set of words in documents. These imaginary topics cluster most words in each document based on their similarity, after which each word is represented by its topic proportions. Based on the topic contribution of the words in each document, the topics are allocated to documentsas illustrated by showing the prevalence (y-axis, figure on the left) of each topic in the documents. The topics are not known before the analysis, and they are latent unobserved variables in the mathematical model. There are no restrictions for the characteristics of the data LDA uses to determine the topics. Downstream analysis steps can be used to assign labels to characterize what type of words each topic represents i.e., the interpretation of the topics is required.

Clustering an unsupervised machine learning approach for grouping and classifying a given group of objects into several groups, i.e. clusters, with a wide range of machine learning and data analysis applications for various purposes in, for instance, engineering, medical sciences and technology (Saxena et al, 2017; Jiang et al, 2019). Clustering of binary cell- topic matrix in cisTopic is performed with density-based clustering algorithm. The algorithm assumes, that points that represent cluster centres with large local density ρ (rho) are surrounded by neighbouring points with lower density and are from large distance δ (delta) from other points with higher density (Fig. 3A and B). The first step of the algorithm is to assign the cluster centres as points with both large local density ρ and large distance δ from its nearest neighbour with higher density, after which the data is grouped into clusters.

Therefore, the value of ρ indicates the local density of a cluster centre and its minimum distance δ from other clusters (Rodriguez & Laio, 2014). The number of clusters is manually decided by the user by setting the values for ρ and δ and evaluating their suitability for the data in question based on a decision graph (Fig. 3A). Cluster core consists of the points with a density higher than ρ. In addition, the cluster borders are defined as points, which are assigned to a certain cluster and are within a distance of δ from belonging to other clusters.

The rest of the points are considered a cluster halo and can be interpreted as background noise. In large datasets the clustering results are robust to the choice of δ and sensitive to the magnitude of ρ, which is relative to the dataset (Rodriguez & Laio, 2014). Therefore, it is essential test and adjust the values of ρ and δ separately for each dataset, and evaluate their suitability based on a decision graph (Fig. 3B). As the choice of values also affects the

(14)

12 clustering, the clustering performance could be assessed by how compact they are in describing the data. Also, there are several metrics that can be used to validate the clusters, performance, including Silhouette (Rousseeuw, 1987) and Davies-Bouldin (Davies &

Bouldin, 1979).

Figure 3. Schematic overview of density peak clustering. (A) Point distribution of sample points, which are colored by red to describe assignment to a cluster. Black points represent cluster halos. (B) Decision graph in which the points with larger δ and ρ are considered cluster centres (colored with red). Adapted from Rodriguez

& Laio, 2014.

As scATAC -seq data analysis utilizes numerous computational methods with several versions of different algorithms (e.g. WarpLDA for LDA), it is often necessary to test the effect of parameter values to the results in each step of the analysis. Thus, benchmarking of new analysis approaches is important. Also, when utilizing tools in research, it may be necessary to test different analysis templates for different types of sample sets (e.g. cell line vs. primary samples) and experimental designs (e.g. effects of treatments vs. intra sample heterogeneity) in order to evaluate, what effects the choices made in the analysis steps have in relation to the biological interpretability.

In this thesis project, a total of seven scATAC-seq datasets consisting of replicates from leukemic cell models Nalm-6 and RS4;11 cells and two primary samples of mononuclear cells from the bone marrow of B-ALL patients representing subtypes of DUX4/ERG and high-hyperdiploidy (HeH) were first pre-processed with Cell Ranger ATAC pipeline. The pre-processing step was optimized to obtain only human cells from the data initially consisting of both human and mouse (not a part of this project) cells, to make it compatible with the downstream analysis tool cisTopic. Initially, the cell line data was selected as the

(15)

13 primary dataset for this thesis project. Specifically, previously existing bulk data from the Nalm-6 cells was deemed helpful to benchmark the new single cell data analyses. However, tools such as cisTopic were developed for characterizing intra-sample heterogeneity rather than for the across sample treatment effect comparison. Therefore, the rest of the samples were added to the analysis.

Materials and methods Data pre-processing

The scATAC-seq data produced with Chromium Single Cell ATAC was used in the data analysis. The nuclei isolation of two of the samples, T4 HeH B-ALL and T5 DUX4/ERG B-ALL, had been performed according to Global Run-On sequencing (GRO-seq) protocol instead of scATAC-seq. Sequencing had been performed in the FIMM Technology Center Sequencing Laboratory, Biomedicum, Helsinki, Finland using NovaSeq S2 sequencer aiming depth of 50 000 reads per cell. Initially, scATAC was performed as multi-species experiment (Barnyard), the scATAC-seq libraries used for the analysis consisting of 1:1 of both human (hg19) and mouse (mm10) cells (Fig. 4A) and was initially aligned with cellranger-atac count multiple species algorithm, which was designed for 1:1 human to mouse ratio and to detect barcode multiplets. Cell Ranger ATAC algorithm assumes that there is always background noise consisting of empty droplets and barcode multiplets (Gene Expression Algorithms Overview, n.d.; Cell Ranger ATAC Algorithms Overview, n.d.) In order to use only hg19 data for downstream analysis with cisTopic, the output from previous cellranger-atac count needed to be separated to consist of hg19 cells only (Fig. 4B).

(16)

14

Figure 4. The starting point of the data pre-processing. (A) Each scATAC-seq dataset was constructed of approximately 1:1 mouse and human cells. (B) Description of the human cells and conditions in the sample sets T1-T8 and the number of human and mouse cells associated with barcodes in Barnyard analysis. For the purposes of this project, the mouse cells needed to be excluded from the Cell Ranger output.

As there is no guidance provided by 10x Genomics for separating a subset of the Barnyard output, three different re-analysis approaches were attempted to pre-process the datasets.

First, only cells recognized as human cells in the Barnyard analysis were re-analysed and aligned to hg19 reference genome. Barnyard singlecell.csv files with per-barcode counts and metrics were used to create position sorted BAM files containing only hg19 cell information using bedtools2/2.27.1 and samtools/1.9. Using Cell Ranger’s bamtofastq (v. 1.3.2) conversion tool, the BAM files were converted to FASTQ format. Cell Ranger ATAC’s (v2.0) count pipeline was used for data pre-processing using the hg19 reference genome.

Secondly, all the cells recognized in the Barnyard analysis were aligned to the hg19 reference genome. This was performed by aligning the Barnyard analysis FASTQ files to hg19 reference genome with cellranger-atac count. Third, mm10 cells including multiplets with both mm10 and hg19 barcodes, were subsetted from Barnyard singlecell.csv output file. The subsetted CSV file was used to create position sorted BAM files using bedtools2/2.27.1 and samtools/1.9. Using CellRanger’s bamtofastq (v. 1.3.2) conversion tool, the BAM files were converted to FASTQ format. FASTQ files were then pre-processed using hg19 reference genome.

(17)

15 Downstream analysis with cisTopic

The workflow for the downstream analysis is described in figure 5. Cell Ranger ATAC output was used to generate a binary accessibility matrix with cisTopic. (Bravo González- Blas et al, 2019). cisTopic uses an accessibility matrix as an input, which was created from CellRanger output files singlecell.csv, peaks.bed and fragments.tsv.gz. Dimensionality reduction by topic modelling was performed with WarpLDA using the default values for hyperparameters α and β and 2:15, 20, 25, 35, 40, 45 and 50 topics for testing the model (Chen et al, 2016). The best topic model for each dataset was selected based on the second derivative on log-likelihood curve. Based on the best topic model, cisTopic represents the cellular resolution chromatin accessibility data as a matrix with topics as rows, cells as columns and topic contributions as values.

The normalized topic-cell distributions were retrieved for clustering and dimensionality reduction. Data was normalized using probabilities i.e., by dividing the assignment to each topic by the total number of assignments in that cell. For T1 Nalm-6 DMSO dataset, the alternative z-score (a numerical measure of value’s relation to the mean in a normally distributed data) normalization was tested. Clustering was performed with peak density algorithm (Rodriguez & Laio, 2014). Further dimensionality reduction to obtain a two- dimensional map of cells separated based on their topic profile and visualization was performed by UMAP setting the perplexity to a value of 200. All the samples were analysed up to this point, but the rest of the analysis was performed on samples T5 and T7 only, as the optimization of the workflow is required to analyse rest of the samples reliably.

The next step was to explore the region-topic distributions of the samples T5 and T7 using enrichment analysis, for which the top regions needed to be obtained from the probability distributions to work with a set of regions instead of the probability rankings. For selecting the top regions, cisTopic provides GammaFit method, for which a threshold is given to select regions above a certain probability. The probability threshold was set to a value of 0.975.

Based on the top regions (~ 2000-3500 per topic), enrichment of epigenomic features was performed on the T5 dataset to compare the topics to known open chromatin regions in mononuclear blood cells. Bulk ATAC signatures were obtained from 10 x Genomics (6k peripheral blood mononuclear cell PBMC dataset from a healthy donor), and they were used to assess which cells in the single cell data were enriched for certain cell-type–specific

(18)

16 signatures. The epigenomic regions in the bulk data were first intersected and mapped to the regions in the dataset with at least 40% overlap, after which the regions were ranked based on their probability (x-axis). In cisTopic’s implementation, one step increase in the y-axis is added if the region is present within a cell. Finally, the area under the curve (AUC) was used to assess the importance of signatures in cells. The cluster mean was used to assess the topic contribution to each cluster in datasets T5 DUX4/ERG B-ALL and T7 RS4;11 Wee1i. The cell-topic score normalized by probability distribution was used to calculate the cluster mean per topic.

Pathway enrichment analysis

Pathway enrichment analysis was performed using Genomic Regions Enrichment of Annotations Tool v4.2.0 (GREAT) (McLean et al, 2010). Pathway analysis was conducted on the top regions of the datasets T5 DUX4/ERG B-ALL and T7 RS4;11 Wee1i. The USCS hg19 assembly was used. First, the region coordinates are associated with TSS regions of genes, and this list of genes is used for hypergeometric test. Binomial testing is an alternative to hypergeometric test, and it takes into consideration, which portion of the genome is covered by the input to calculate the enrichment The user can decide, whether to use both tests or either one to conduct the analysis. Each test has a bias, which can be compensated by using both tests (McLean et al, 2010). According to BEDtools intersect, most of the top regions fall on TSS regions of genes, therefore the genomic regions were associated to single nearest gene within 1000 kb to observe the chromatin accessibility at and near genes and which GO terms are enriched in that setting. Also, the default option basal plus extension gave only a small number of terms per topic. Whole genome was used as background. False discovery rate (FDR) threshold of both binomial and hypergeometric tests was set to a value of 0.05. Based on both statistical tests, the top five GO Biological Process terms per topic as well as some topic-specific terms were used for visualization.

Motif enrichment analysis

Homer v4.10 was used to conduct topic-specific motif enrichment analysis on the top regions of the datasets T5 DUX4/ERG B-ALL and T7 RS4;11 Wee1i. Topic-specific BED- files, 25 in total, were converted to Homer peak files with bed2pos.pl. The genomic positions were analysed using findMotifsGenome.pl on the peak files to identify, which regulatory elements are enriched compared to other based on zero or one occurrence per sequence

(19)

17 (ZOOPS) scoring and binomial distribution (HOMER Motif Analysis, n.d.). In Homer output, there are often motifs, that can be recognized by several TFs in a same family of proteins. Therefore, the TFs whose own gene was active in the open chromatin were selected to narrow down the list of motifs. All hg19 RefSeq genes were downloaded from the USCS Table Browser, information on the TSS coordinates and gene names was extracted and converted to BED format. BEDtools intersect was used to overlap the TSS coordinates and the peaks.bed files of samples T5 and T7 to check the TSS signal of TF genes, whose motifs had p-values ≤ 1e-40 as well as some topic-specific motifs. The motifs, whose TSS was not present, were discarded and the rest were used for visualization. The Jupyter Notebooks and scripts used for conducting all the analyses are available in GitHub.

Figure 5. Downstream analysis workflow with cisTopic. (A) To begin with, the Cell Ranger ATAC count output was used to generate a binary accessibility matrix with cisTopic. (B) Next, WarpLDA implementation was used for to retrieve two probability distributions: the topic contributions per cell and regions contribution to a topic, producing matrices which enables clustering and representing the data with reduced dimensionality more easily than the original binarized matrix. (C) Cell-topic distribution was then used for clustering and visualization. Clustering was performed with peak density algorithm and visualization by UMAP. (D) Lastly,

(20)

18

topics were explored based on the region-topic distribution. Enrichment of epigenomic features was performed on sample T5. Bulk ATAC signatures were usedto compare the topics to known open chromatin regions in mononuclear blood cells. Top regions of samples T5 and T7 were used also to conduct motif analysis with Homer and pathway analysis with GREAT. (Adapted from Bravo González-Blas et al, 2019).

Results

Generating input matrices from mouse-human cell mix libraries

The original Barnyard analysis output contained approximately 1:1 mouse and human cells.

which only human cells were needed for the purposes of this project. Table 4 describes the number of human cells recognized by Cell Ranger ATAC count in each of the three approaches attempted to separate only human cells and remove empty droplets and doublets for the downstream analysis. The first approach was to align the cells recognized as human cells in the Barnyard analysis to the hg19 reference genome. Only 673 human cells were recognized, which was 23% of the human cells in the Barnyard analysis. On the second re- analysis approach, all cells were aligned to human reference. When all the Barnyard analysis cells were re-analysed and aligned to hg19 reference, the yield of hg19 cells was 4,499, nearly twice as large number as in the initial analysis. In the third approach, mouse cells and mouse/human doublets were removed from the re-analysis step. This time, the number of cells was very close to the initial number of human cells. The barcode knee plot (Supplementary Fig. 1A-7A) shows the fragments that overlap the peaks and barcodes that are associated with cells in the last pre-processing approach. In each dataset the separation between non-cells and cells has been good, indicated by a steep drop-off in the knee plot, yet the separation has been best in datasets T5-T8. In comparison, the separation was remarkably worse in Nalm-6 datasets, when only hg19 cells or all cells were used (Supplementary Fig. S8-S10). The output from the last pre-processing attempt was used for the downstream analyses.

Table 3. Results from single cell library analysis with Cell Ranger ATAC. Barnyard analysis results were the starting point for other three analyses. When cells recognized as hg19 in Barnyard analysis were aligned to hg19 reference, only a small amount of hg19 cells was recognized. Aligning all the cells to hg19 reference resulted in almost twice as large number of hg19 cells as the Barnyard analysis. When the mm10 cells and

(21)

19

mm10/hg19 doublets were subsetted, and the rest of the cells aligned to hg19 reference, result was close to the initial number of hg19 cells in the Barnyard output.

Evaluation of metadata quality based on QC metrics and WarpLDA

Selected quality control metrics visualisations from Cell Ranger pipeline are presented in the supplements (Supplementary Fig. 1-7). The fragment size distribution (Supplementary Fig. 1C-7C) of the datasets generated periodical peaks corresponding to mono-, and di- nucleosomes (~ 200 and 400 bp, respectively) and nucleosome-free regions (NFRs) (< 100 bp) as is typical for an ATAC-seq experiment (Yan et al, 2020). The fragments from NFRs should be enriched around the TSS regions with an enrichment of nucleosome-bound regions flanking the TSS regions. Indeed, the NFR fragments are mostly centred around TSS regions (Supplementary Fig. 1D-7D), yet there is a smaller peak corresponding to a +1- nucleosome positioning at ~ 200 bp from the centre corresponding to mono-nucleosomes.

Also, in datasets T1-T4, there is a slight peak at around +400 bp from the centre corresponding to di-nucleosomes. Therefore, according to the quality metrics provided by Cell Ranger ATAC, the quality of the datasets seems good.

Best topic model (number of topics) for each dataset was selected based on WarpLDA and the second derivative on the log-likelihood curve, which is the default method for selecting the best model. It measures the curvature between points on the curve and a higher value is an indicator that next model will not improve the log likelihood much (Bravo González-Blas

& Aerts, 2020). The best model varied from 3 to 13 topics between datasets. The value of ρ (local density) was linear with the number of cells in a dataset (Supplementary table 1).

Therefore, as the number of clusters was aimed to 5-7 depending on a dataset, the values of ρ for vary greatly between datasets, while the values of δ remain less than ten. Notable intra- and inter-sample variation in sequencing depth was observed. The sequencing depth between samples varies from 7,267 median counts per cell in sample T2 Nalm-6 DEX to 21,886 in T7 RS4;11, yet there is a large difference between min. counts per cell (hundreds

Only hg19 cells¹ All cells¹ mm10 cells subsetted¹ ²

Sample ID Human cells Human condition hg19 mm10 hg19 hg19 hg19

T1 Nalm-6 DMSO ctrl 24 h 2,969 2,985 673 4,499 2,619

T2 Nalm-6 dexamethasone 24 h 4,563 3,097 1,021 5,786 3,841

T3 Nalm-6 dexamethasone and ML-792 24 h 2,249 3,352 1,451 3,990 1,880

T4 B-ALL diagnosis 2,350 4,974 1,934

T5 B-ALL diagnosis 5,309 4,459 4,547

T7 RS4;11 AZD1775 24 h 5,288 5,119 3,825

T8 RS4;11 DMSO ctrl 24 h 6,658 3,27 5,724

¹ aligned to hg19 reference

² including hg19 and mm10 doublets

Barnyard analysis

(22)

20 in most datasets) and max. counts (tens or hundreds of thousands) per cell in each dataset.

Nevertheless, no filtering of the cells was performed based on the sequencing depth.

Supplementary figure S11 provides a UMAP visualization of T1 Nalm-6 dataset normalized using Z-score instead of probability, which resulted in ill-defined worm-shaped clusters.

Therefore, the probability was used for normalization instead. Density peak clustering of cells in each dataset normalized with probability as well as the UMAP visualizations of the metrics provided in the supplementary table 1 are shown in figures 7A and 12A and supplementary figures S12-S18. According to WarpLDA, the best model in T1 Nalm-6 DMSO only consisted of three topics, which is curious, as the number of topics is usually slightly larger than the number of potential cell states in a sample (Bravo González-Blas &

Aerts, 2020) but based on the sequencing depth and Cell Ranger ATAC quality metrics, T1 Nalm-6 DMSO dataset was not different from rest of the Nalm-6 datasets. Therefore, the Nalm-6 datasets were not analysed further in this thesis project. The separation of the cells and non-cells was best in datasets T5-T8, and the rest of the analysis workflow was conducted on them.

Promoters, intergenic and intronic regions represent a set of regions accessible in most Wee1i-treated RS4;11 cells.

The purpose of Wee1i treatment in RS4;11 cell line was to block the Wee1 kinase, which has a role in the control of cell cycle. According to experiments carried out by our group (Maljukova et al, manuscript in preparation) the treatment inhibits the proliferation of MLLr leukemic cells. Based on density-based clustering of the data, cells were divided into seven clusters and visualized as UMAP (Fig. 6A). The topic-cell distribution was used to calculate the cluster mean to define which topic contributed to which cluster. Based on the topic- cluster heatmap, topics 6 and 12 appear as general topics which contribute to most clusters (Fig 6B) The region-topic distribution was used to perform an annotation to genomic regions. In T7 RS4;11 Wee1i dataset promoters represent a set of regions accessible in most cells, yet topic 1 and 11 stand out from this pattern by representing mostly intronic and distal intergenic regions (Fig. 6C).

(23)

21

Figure 6. Topics 6 and 12 appear as general topics in RS4;11 Wee1i dataset and represent mostly proximal promoters. (A) Density peak clustering of the cells in RS4;11 Wee1i dataset. (B) Cluster-topic heatmap based on calculated cluster mean. The range of colours shows the range of the probability distribution. (C) Heatmap of the annotation to the genomic regions. The range of colours stands for the normalized AUC score.

Pathways related to regulation of cell cycle phase and cell damage are highly enriched in Wee1i-treated RS4;11 cells

The pathway analysis results of T7 RS4;11 Wee1i dataset are given as dot plots in figures 7 and 8. Topics in which the GO terms were not enriched are excluded from the dot plots. The dot plot of RS4;11 Wee1i sample pathway analysis shows an enrichment of a large set of pathways related to the negative regulation of cell cycle phase transition, particularly the G2/M phase, in several topics but mostly in topics 4 and 12 (Fig. 7A). Also, the cell damage- related pathways are highly enriched, yet these pathways are only enriched in topic 12, apart from regulation of signal transduction by class p53 mediator, which is enriched in topic 8 (Fig. 7B). Interestingly, topic 12 was also the topic that contributed to most cells (Fig. 6B).

(24)

22

Figure 7. Cell cycle (A) and DNA damage (B) related GO terms in RS4;11 Wee1i dataset.The range of colour red (dark red, red, light pink, grey) reflects the range of the -10 log10 converted FDR q-values. Fold Enrichment represents the Binomial Fold Enrichment of the pathway in a topic.

Immune response and homeostasis related pathways are mostly enriched in topics, that did not show enrichment in cell damage and cell cycle related GO terms (Fig. 8). Topic 5 is enriched with the terms related to regulation of immune response and topics 1 and 10 with the terms related to regulation of lymphoid cells (Fig. 8A). Homeostasis-related terms are concentrated in topics 6, 9 and 13 (Fig. 8B).

(25)

23

Figure 8. Immune response (A) and homeostasis-related (B) GO terms in RS4;11 Wee1i dataset. The range of colour red reflects the range of the -10 log10 converted FDR q-values. Fold Enrichment represents the Binomial Fold Enrichment of the pathway in a topic.

The accessibility of p53 and RUNX motifs is concentrated on specific topics in RS4;11 Wee1i cells

The motif analysis results of T7 RS4;11 are shown in Figures 9 and 10. RUNX-family motifs are highly accessible in the RS;411 Wee1i dataset, especially in topics 1, 11 and 7 (Fig. 9).

The tumour suppressor p53 motif is enriched in topic 2 along with its homolog p73 and NFκB-p65 motif in topic 5. The ETS TF-family is also highly accessible in RS;411 Wee1i cells (Fig. 9).

(26)

24

Figure 9. RUNX and ETS family, p53 homolog and NFκB-p65 motifs in the RS4;11 Wee1i dataset. The range of colours reflects the -10 log10 of the p-values of each motif and the size of the dot the percent of target sequences in a topic with the motif.

Topic 11 of RS;411 Wee1i sample shows an enrichment of basic leucine zipper (bZIP) AP-1 TF complex motifs of AP-1, JunB and Fos (Fosl2, Fra1, Fra2) family, and the closely related activating TF subfamily (Atf2 and BATF) with extremely low p-values (< 1e120) (Figure 10A). The bHLH motifs TCF4 and Tcf12 show accessibility in topics 1 and 7, in which also the RUNX motifs were observed to be highly accessible (Figure 10B).

Figure 10. Common motifs in RS4;11 Wee1i dataset. (A) bHLH and Zf motifs (B) bZIP AP-1 TF complex subunit motifs AP-1, Fra1, Fra2 and Fosl2 as well as closely related activating ATF family motifs BATF and Atf3 are highly enriched in topic 11 with p-values < 1e120. The range of colours reflects the -10 log10 of the p-values of each motif and the size of the dot the percent of target sequences in a topic with the motif.

Monocytic signatures show notable enrichment in the DUX4/ERG B-ALL dataset

The T5 DUX4/ERG B-ALL dataset consists of mononuclear cells from the bone marrow of a patient with B-ALL DUX4/ERG subtype. In addition to the same analyses performed on T7 RS4;11 Wee1i dataset, an epigenomic enrichment analysis was conducted on T5

(27)

25 DUX4/ERG dataset (Figure 11C). Based on density-based clustering of the data, cells were divided into six clusters and visualized as UMAP (Fig. 11A). Topic 7 appears as a general cluster contributing to each topic in T5 DUX4/ERG B-ALL (Fig. 11B) and it represents mostly proximal promoters (≤ 1) (Fig. 11D). Also, topics 11, 12 and 4 contribute to most clusters (Fig. 11B). Topics 11 and 12 constitute of several types of genomic regions whereas topic 4 constitutes of proximal promoters (Fig 11D). Topic 5 and topic 12 are enriched with stem cell–related signatures. Topic 2 is specific to cluster 5 and it represents especially exons, 3’ and 5’ untranslated region (UTR) sequences and distal promoters (1-3 kb). Topic 1 is highly enriched with monocyte related signatures (Fig. 11C). Topic 3 and 10 on the other hand are enriched with lymphocyte-related signatures and topic 6 specifically B-cell -related signatures. Topics 2,4,7 and 9 are not enriched with any cell-type-specific signatures and they mostly represent promoters, both proximal and distal. In fact, promoters represent a set of regions accessible in most cells.

Figure 11. T5 DUX4/ERG B-ALL dataset shows distinction in cell type specific epigenetic signatures and promoter accessibility between topics. (A) Density peak clustering of the cells. (B) Cluster-topic heatmap based on calculated cluster mean. The range of colours shows the range of the probability distribution. (C) Heatmap of the annotation to the genomic regions. The range of colours stands for the normalized AUC score.

(28)

26 DUX4/ERG B-ALL is characterized by enrichment of myeloid and lymphoid pathways The pathway analysis results of T5 DUX4/ERG B-ALL dataset are shown as dot plots in figures 12 and 13. Again, topics in which the GO terms were not enriched are excluded from the dot plots.The pathway analysis of DUX4/ERG B-ALL shows distinctive division of myeloid and lymphoid pathways between topics (Fig. 12). Topic 1, which is enriched with monocyte-related signatures based on the epigenomic enrichment analysis (Fig. 11C) is also enriched with monocyte-related pathways (Fig. 12A), whereas topics 6, 11 and 12 are clearly enriched with pathways related to lymphoid and other immune response related processes (Fig. 12B).

Figure 12. Pathways related to (A) myeloid and (B) lymphoid and immune response related processes in DUX4/ERG B-ALL dataset. The range of colours represent the range of the -10 log10 converted FDR q- values. Fold Enrichment shows the binomial fold enrichment of the pathway in a topic.

The general topic 7, which contributed to each cluster (Fig. 11B), is enriched by general GO terms contributing the basic maintenance functions of cells, for example chromatin organization, post-transcriptional regulation of gene expression and epigenetic regulation of gene expression (Fig 13A). Some apoptosis and cell stress related terms emerge, mostly in topics 5 and 11, yet their representation is much weaker than in T7 RS4;11 Wee1i dataset (Fig. 13B).

(29)

27

Figure 13. (A) Homeostasis and (B) cell damage related go terms in T5 DUX4/ERG B-ALL dataset. The range of colours represent the range of the -10 log10 converted FDR q-values. Fold Enrichment shows the binomial fold enrichment of the pathway in a topic.

The ETS family motifs are highly enriched in DUX4/ERG B-ALL

The motif analysis results as a dot plot for DUX4/ERG B-ALL are presented in Figures 14 and 15. The Erythroblast Transformation Specific (ETS) TF family members PU.1, GABPA, Fli1, Etc2, ETV1, ETS1, ETS, ERG, Elk4, Elk1 and Elf3 show remarkable accessibility in most topics (Fig. 14). The ETS family was also enriched in the T7 RS4;11 Wee1i dataset, but seemingly less. Also, PU.1, Fli1, Etc2, ETV1 and Elf3 did not show any notable enrichment in Wee1i-treated RS4;11 cells. PU.1 and RUNX motifs are especially enriched in topic 1 of DUX4/ERG B-ALL (Fig. 14B), which was enriched by the myeloid pathways (Fig. 12A). Two zinc finger motifs, Sp/XKLF TFs Sp1 and Sp2 show accessibility across all topics, Sp1 less than Sp2, and this is especially in topics 4 and 7 (Fig. 15A).

Interestingly, there is an anti-correlation between their accessibility and basic helix-loop- helix (bHLH) motifs Tcf12, EA2 and HEB, which are accessible in larger portion of cells in topics, where Sp1 and Sp2 are less accessible. Tumour suppressor p53 motif and its homolog p63 are non-existent (Fig 15B).

(30)

28

Figure 14. Enrichment of ETS TF motifs in DUX4/ERG B-ALL. The range of colours reflects the -10 log10 of the p-values of each motif and the size of the dot the percent of target sequences in a topic with the motif.

Figure 15. Common motifs and RUNX, p53 and IRF family motifs in DUX4/ERG B-ALL. (A) The enrichment of basic helix-loop-helix, zinc finger and basic leucine zipper motifs (B) The enrichment of RUNX, p53 and IRF family motifs. The range of colours reflects the -10 log10 of the p-values of each motif and the size of the dot the percent of target sequences in a topic with the motif.

Discussion

In this project, a data analysis was conducted on seven single-cell ATAC-seq samples of B- cell acute lymphoblastic leukemia with an aim to set up a data analysis workflow for scATAC-seq that would contribute towards understanding the cell-to-cell heterogeneity and control of gene expression in the samples. These seven samples consisted of two cell models:

three replicates of Nalm6 (DUX4/ERG) cells and two replicates of RS4;11 (MLLr) cells and two B-ALL diagnostic bone marrow samples of DUX4/ERG and HeH subtypes. The purpose of the cell line experiments was to study the mechanisms of actions of two promising B-ALL therapeutics. We also aimed to characterise the single-cell level heterogeneity of two B-ALL subtypes in a way, that would aid in disentangling their possible

(31)

29 diagnostic and treatment approaches in the future. HeH is the most common subtype of B- ALL and DUX4/ERG a newly identified subtype, which is currently difficult to diagnose with standard techniques (Inaba & Pui, 2021; Rehn et al, 2020), making these subtypes attractive subjects for research. In addition, to our knowledge, they have not been characterized by scATAC-seq assay yet. cisTopic, one of the best performing computational methods for analysing scATAC-seq data according to Chen et al, 2019, was used for the downstream analysis. Motif and pathway analyses were performed with Homer (Homer Software and Data Download) and GREAT (McLean et al, 2010), respectively. Based on this thesis work, cisTopic was able to recognize similar characteristics of the T7 RS4;11 Wee1i as the previous analyses conducted by our group and characterize intra-sample heterogeneity in bone marrow mononuclear cells in DUX4/ERG B-ALL subtype, including the enrichment of ETS family TFs and myeloid characteristics, that may aid channelling the study of the new treatment and diagnostic approaches for this subtype in the future. Indeed, cisTopic has shown to be successful in the analysis of scATAC-seq data from mononuclear cells (Chen et al, 2019). As cisTopic was experienced useful in the analysis of the DUX4/ERG subtype, the next step is to continue with the analysis of the T4 HeH B-ALL dataset in a similar way. Improving the workflow for comparing across sample treatment effects is still needed.

The pre-processing of the datasets produced with 10 x Genomics Chromium Single Cell ATAC was performed with Cell Ranger ATAC by 10 x Genomics. According to the quality metrics, the cell metadata of each sample was of good quality. As there is no guidance provided by 10x Genomics for separating a subset of the Barnyard output, three different re- analysis approaches were attempted to separate only human cells and regions from the original output consisting of both human and mouse cells and regions as features. When only the cells recognized as human in the Barnyard analysis were used for re-analysis, only approximately 20% of the initial number of human cells resulted as an output. Cell Ranger algorithm is based on the EmptyDrops method which uses the molecule profile of the barcodes to determine whether a droplet is empty or contains a cell (Lun et al, 2019) and to recognize barcode multiplets (human and mouse) and thus, estimate the total number of multiplets including mouse-mouse and human-human (Gene Expression Algorithms Overview, n.d.). Therefore, as the real background is removed nearly completely before alignment, cells with low or high signal might have been recognized as background and

(32)

30 filtered out, resulting in the low number of cells as the output. Also, as the algorithm estimates the number of fragments per barcode by subtracting a fixed, depth-dependent number of counts from all of the barcodes, the counts that are associated with barcodes may end up being too low in this approach (Cell Ranger ATAC Algorithms Overview, n.d) . The alignment of all the cells to hg19 reference genome resulted in too large number of cells, simply because some counts from mouse cells also align to the hg19 reference. Also, as the total counts is the sum of mm10 and hg19 cells, the level of remaining mm10 counts will likely be left too low. According to Waterston et al, 2002 approximately 40% of human and mouse genomes are alignable to each other. Therefore, hg19 alignment results in a too large number of hg19 cells as a proportion of mouse cells are aligned to human genome, and too few mm10 cells that are not aligned. As the mouse cells and mouse-human doublets were removed from the reanalysis, the output consisted of a cell number close to the original Barnyard analysis. Most likely there was still some background noise left from empty droplets and human-human multiplets, which allowed the Cell Ranger ATAC algorithm work as it was designed. The output files from the third re-analysis were used for downstream analyses.

As stated previously, the basis of machine learning approaches including clustering, dimensionality reduction and data transformation methodologies used for analysing scATAC-seq data are in text mining. None of the current methods in frequent use have been developed specifically for analysing scATAC-seq data. The analysis of scATAC-seq poses methodological challenges stemming from the data sparsity, which results from the high noise-to-signal ratio and low copy number of DNA (two in diploid organism) (Chen et al, 2019; Baek & Lee, 2020). A genomic locus can be accessible in both alleles, one allele or neither of the alleles. As most accessible regions are not transposed even in high-quality scATAC-seq data, many loci end up having no accessible alleles (Granja et al, 2021). What comes to the dimensionality reduction method used in this study, LDA has been assumed to process sparse data quite well (Durham et al, 2021), yet in the context of text analysis, LDA has challenges to handle unconventional datasets (e.g. posts in social media) and short documents, datasets of too few documents or documents with a large number of topics, i.e.

the length of the documents and number of topics have a crucial role in the performance of the model (Tang et al, 2014). Therefore, it can be assumed that the low or high number of features per cell in scATAC-seq data can also lead to mixed results. Also, there are only

(33)

31 theories to explicate the behaviour of LDA and no restrictions for the characteristics it uses to determine the topics, therefore there are no clear standards for the data properties affecting the performance of the model (Tang et al, 2014). On that account, the interpretation of the topics is challenging.

As there is no precise concept for “cluster”, several clustering methodologies have been developed with differences in the principles based on which the data features are included in clusters (Saxena et al, 2017). Density peak clustering has been proven to be highly efficient for assigning and elimination of noises, yet it has its disadvantages (Li & Zhang, 2020; Jiang et al, 2019). To begin with, the number of clusters needs to be manually decided by the user and making the method susceptible affected for subjective factors (Jiang et al, 2019). Additionally, the model lacks an uniform measure for density (Li & Zhang, 2020).

Lastly, as the method is highly sensitive to local density and distance, the model cannot adapt to differences in data structure and the cluster centres can be incorrectly selected (Zhuang et al, 2020). These pitfalls can affect the clustering results, which makes finding objective and reasonable cluster centres difficult. Still, the algorithm has been considered to have several advantages compared to other clustering methods (Zhuang et al, 2020) and is considerably fast and simple to use.

The RS4;11 cell model aimed to assess the mechanisms of action the of Wee1 inhibitor AZD1775 treatment on leukemic cells. The scATAC-seq data of the RS4;11 samples have already been analysed in our group before with another scATAC-seq analysis tool, Signac.

The basis of Signac is in LSI, an another common computational method used for scATAC- seq data analysis in addition to LDA, which utilizes combination of TF-IDF and SVD (Cusanovich et al, 2015; Stuart et al, 2019),. In this project the RS4;11 datasets (mainly T7 Wee1i) were used as control to assess the functionality of the workflow and the performance of the analysis tools used. Wee1 is a tyrosine kinase, which acts as a gatekeeper of the G2/M cell cycle check point by preventing entry into mitosis in response to DNA damage. Wee1 acts by phosphorylation of CDK1 (cyclin dependent kinase 1) which maintains CDK1 in inactive state, thus protecting nucleus from CDK1 and preventing cell entering M phase.

Overexpression of Wee1 has been observed in several cancers, including leukemia, depending on a functional G2/M cell cycle phase for DNA repair. Wee1 helps the cells avoid delayed mitosis-linked cell death also known as mitotic catastrophe (Matheson et al, 2016).

Wee1 is highly expressed in cancers dependent on G2/M checkpoint for DNA repair as they

Viittaukset

LIITTYVÄT TIEDOSTOT

nustekijänä laskentatoimessaan ja hinnoittelussaan vaihtoehtoisen kustannuksen hintaa (esim. päästöoikeuden myyntihinta markkinoilla), jolloin myös ilmaiseksi saatujen

Ydinvoimateollisuudessa on aina käytetty alihankkijoita ja urakoitsijoita. Esimerkiksi laitosten rakentamisen aikana suuri osa työstä tehdään urakoitsijoiden, erityisesti

Hä- tähinaukseen kykenevien alusten ja niiden sijoituspaikkojen selvittämi- seksi tulee keskustella myös Itäme- ren ympärysvaltioiden merenkulku- viranomaisten kanssa.. ■

Mansikan kauppakestävyyden parantaminen -tutkimushankkeessa kesän 1995 kokeissa erot jäähdytettyjen ja jäähdyttämättömien mansikoiden vaurioitumisessa kuljetusta

Jätevesien ja käytettyjen prosessikylpyjen sisältämä syanidi voidaan hapettaa kemikaa- lien lisäksi myös esimerkiksi otsonilla.. Otsoni on vahva hapetin (ks. taulukko 11),

Tornin värähtelyt ovat kasvaneet jäätyneessä tilanteessa sekä ominaistaajuudella että 1P- taajuudella erittäin voimakkaiksi 1P muutos aiheutunee roottorin massaepätasapainosta,

Länsi-Euroopan maiden, Japanin, Yhdysvaltojen ja Kanadan paperin ja kartongin tuotantomäärät, kerätyn paperin määrä ja kulutus, keräyspaperin tuonti ja vienti sekä keräys-

29 With the help of an inducible E/R cell model and GRO-seq, we explored dynamics of gene expression and the activity of their regulatory elements simultaneously, exposing