• Ei tuloksia

2.2.1 Genome analysis

The most typical application of NGS is the characterisation of the genomes which can be divided into two main categories: “De novo sequencing” and “re-sequencing”. De novo sequencing is typically used to sequence an unknown or a small organism in order to assemble its genome whereas re-sequencing is commonly used to sequence an organism with a known reference genome to characterise variation in the genome. Genome sequencing methods can be further categorised based on whether the full genome is being sequenced (Whole genome sequencing) or specific portions of the genome are captured for analysis (targeted sequencing and Whole Exome Sequencing).

2.2.1.1 Targeted sequencing

Targeted sequencing is an application of NGS in which only selected regions of the genome are being sequenced. The target regions are first captured and then fragmented for library preparation. There are several methods for target capturing which can be divided into three main categories: Hybrid, selective circularisation and PCR amplification capture. In the hybrid capture, the target regions are captured by hybridisation with complementary nucleic acid sequences, also known as probes, in a solution or on a solid support. Selective circularisation involves single stranded probe sequences, which contain a stretch of universal sequence flanked by target specific sequences. The target specific sequences are complementary to the sequences flanking the target genomic site and during the capture hybridise with these regions. Subsequently, the gap between the target specific sequences is closed by gap filling reactions and finally ligation of the loose ends results in circular nucleic acid molecules containing the regions of interest. In PCR amplification capture, PCR is used to selectively amplify the target regions by using complementary primer sequences of the flanking regions of the target (Mertes et al. 2011).

Targeted sequencing is more cost effective in comparison to WGS and thus used for studies in which only specific regions are of interest. The capturing methods differ in many respects such as the maximum size of the target region which can be captured, required amount of input DNA, the enrichment of reads obtained from

based on the above mentioned parameters enables usage of targeted sequencing in wide variety of applications, targeted sequencing has its shortcomings. The major issue is the relative unevenness of the coverage of reads which can cause difficulties in downstream bioinformatics analyses such as variant calling (Mertes et al. 2011).

2.2.1.2 Whole-exome sequencing

Whole exome sequencing (WES) is a special case of targeted sequencing in which the target consists of exonic regions. The exonic regions are captured using hybrid capture described previously. The most widely used methods for exome capture are sold as commercial kits, including SureSelect (Agilent) TruSeq Capture (Illumina) and SeqCap EZ (Roche NimbleGen) but also custom methods have been applied and developed (García-García et al. 2016).

WES is faster and more cost effective compared to WGS although the price gap has narrowed drastically during the past years (Hayden 2014). Thus, WES is more scalable which enables better statistical power by sequencing more samples. In addition, the amount of data being produced is much smaller which makes the data more manageable, further reducing the expenses by limiting the computational infrastructure required to analyse the data.

Whole exome sequencing is however limited as it requires a well-annotated organism in order to design the probes for the exonic regions. Other disadvantages of WES include less uniform coverage and a more profound allele distribution bias compared to WGS, resulting in less accurate variant and genotype calls (Lelieveld et al. 2015). Finally, the most obvious disadvantage of WES compared to WGS is that regions outside exonic regions, such as regulatory regions, cannot be studied.

Despite its shortcomings and due to its cost effectiveness WES is a widely used sequencing application. In general, WES is best suited for large-scale population studies of the exotic regions of well-known species. Studies of human Mendelian diseases represent one such example since the variants associated with the disease phenotype are known mostly to occur in the exonic regions (Bamshad et al. 2011).

2.2.2 Transcriptome analysis

Transcriptome analysis involves the characterisation of the sequences being transcribed by an organism including both coding transcripts, such as mRNAs, and non-coding transcripts such as lincRNAs, snoRNAs and miRNAs to name a few.

Perhaps one of the most commonly utilised techniques for studying the transcriptome is RNA-seq. It can be used to study sufficiently long transcripts while short transcripts (< 200 bp) can be studied using other specialised methods such as small RNA-seq.

2.2.2.1 RNA-seq

RNA-seq enables the profiling of the entire transcriptome including protein coding genes as well as non-coding transcripts such as lincRNAs and repetitive elements.

Compared to the earlier array-based methods no prior knowledge of the transcriptome is required, which allows detection of novel isoforms and non-coding transcripts which are transcribed only in specific tissues or conditions. Furthermore, it is possible to assemble the whole transcriptome for organisms for which a reference genome has not yet been constructed. Other advantages over the previous array-based technologies include a higher dynamic range of detected expression levels as well as more accurate estimation of the abundance of different transcript isoforms (Wang, Gerstein, and Snyder 2009).

In RNA-seq the total RNA content is first extracted from the sample followed by removal of ribosomal RNA (rRNA) using either poly-A capture or rRNA depletion. The purified RNA is then converted to cDNA. Sequencing templates are then prepared by adding adaptor sequences to either one or both ends of the fragments depending on the library preparation strategy. Subsequently, the templates undergo the standard sequencing steps required by the sequencing platform being used (Wang, Gerstein, and Snyder 2009).

RNA-seq is one the most popular sequencing applications as it provides a very comprehensive view of the transcriptome. However, a major drawback of traditional RNA-seq is that it cannot reveal heterogeneity within a sequenced sample, which can contain multiple cell types. Instead, the obtained abundance estimates for the transcripts reflect the average abundances over the populations of the cells. Recent developments in sequencing technologies have now made it possible to study the transcriptomic profile on a single cell level. This technology is referred to as single cell RNA-seq (scRNA-seq) and it is now being widely used to study areas of research, which is beyond the capabilities of traditional bulk RNA-seq (Saliba et al. 2014).

2.2.3 Epigenome analysis

The epigenome is comprised of all chemical changes occurring in DNA and histones, which together make up the chromatin. Numerous sequencing techniques have been developed to study the different aspects of epigenetic modification in the genome. For studying methylation, techniques such as Bisulfite sequencing (BS-seq) and Methylated DNA immunoprecipitation sequencing (MeDIP-Seq) have been developed. Histone modification can be studied using techniques such as ChIP-seq whereas the overall accessibility of the genome can be investigated using DNase-seq and ATAC-seq.

2.2.3.1 DNase-seq

The nucleus DNA is organised into a structure called the chromatin. Its’ basic unit is a nucleosome which consists of approximately 146 bp of DNA wrapped around a histone octamer. The organisation of the DNA as nucleosomes primarily serves as a way to condense the chromatin in order for it to fit inside the nucleus. In addition, the density of the packaging, which varies across the genome, also plays a significant role in gene regulation. The densely packed regions, referred to as heterochromatin, are generally inaccessible to transcriptional machinery and thus genes located within these regions are not expressed. Contrary to heterochromatin, genomic regions, which are less densely packed, are generally known as euchromatin. These regions are more accessible to proteins involved in gene regulation and transcription and therefore genes located within these regions are often actively expressed. The chromatin structure is dynamic and regulated by changes in the composition of the histone proteins composed of octamers and by different post-translational modifications of the histone tails (Valencia and Kadoch 2019).

DNase I is an endonuclease, which digests double stranded DNA by preferentially cleaving the phosphodiester bonds adjacent to pyrimidine nucleotides.

The genomic regions which are nucleosome depleted are sensitive to digestion because the chromatin is exposed and allows the binding of DNase I(Weintraub and Groudine 1976). These regions are generally referred to as DNase I hypersensitive sites (DHS). In contrast, regions tightly packed around nucleosomes and other higher order structures are highly resilient to digestion (Elgin 1981). Because open chromatin regions are accessible to various regulatory proteins, these regions are likely to harbour active genetic regulatory sites including promoters, enhancers, silencers, insulators and locus control regions. Since these regions often coincide

with DHS sites, methods for capturing these sites have been developed as early as from the late 70s´ (Song and Crawford 2010).

The early methods used to study the DHS sites suffered from low throughput and therefore their application has been limited. Nevertheless, since the development of NGS technologies, the old methods for capturing DHS sites have been coupled with the novel sequencing techniques to generate novel sequencing applications. One of these techniques is known as DNase-seq, which can be used to characterise the DHS sites across the whole genome (Song and Crawford 2010).

Moreover, this technique allows the detection of transcription factor binding sites (TFBS) within DHS regions. What makes it possible is the fact that similar to nucleosomes, transcription factors (TF) can protect the chromatin from digestion at the genomic site where they are bound. This can be observed in the data as lowered accessibility within DHS sites also referred to as TF footprints (Kaplan et al. 2009).

There exists two widely used protocols for DNase-seq. In both protocols cells are first lyzed to release the nuclei, followed by the digesting of the genomic DNA using the restriction enzyme DNAseI. In the “double hit” protocol small fragments of between 50–100bp are selected for by using gel electrophoresis whereas in the “end-capture” protocol the ends of all DNA-fragments are ligated to specially designed linkers followed by MmeI digestion yielding 20 bp tags. Depending on the protocol, the fragments or tags are then sequenced by the standard NGS protocols (Sabo et al. 2006; Song and Crawford 2010).

DNaseq has been proven to be a valuable tool in the ENCODE project for characterising the regulatory elements in various cell lines (Dunham et al 2012).

However, this technique requires large amounts of input DNA, because much of the DNA is lost during the purification steps. This limits its usefulness, especially when studying clinical samples (Sabo et al. 2006; Song and Crawford 2010). Recently, single cell DNase-seq also known as Pico-seq has been developed which can be used to study heterogeneity within samples and requires less input DNA (Jin et al. 2015).

Moreover, during the past decade, other sequencing technologies designed for mapping DHS sites have been developed including ATAC-seq, FAIRE-seq, MNase-seq, and NicE-seq. Particularly ATAC-seq has gained popularity because the low amount of required DNA and the possibility to study nucleosome displacement in high resolution (Chang et al. 2018).