• Ei tuloksia

Gene expression data analysis and putative transcription

In document Pattern Discovery from Biosequences (sivua 95-98)

5.4 Visualization techniques

6.1.2 Gene expression data analysis and putative transcription

Microarray technologies for measuring mRNA abundances in cells allow moni-toring of gene expression levels for tens of thousands of genes in parallel. By measuring expression responses across hundreds of different conditions or time-points a relatively detailed gene expression map starts to emerge.

A collection of gene expression level measurements taken under various ex-perimental conditions by microarray or any other technology, define expression profiles of the respective genes. There are many surveys of the technology and analysis in general (The Chipping Forecast 1999; Brazma & Vilo 2000;

Celis et al. 2000; Hegde et al. 2000; Dopazo et al. 2001; Quackenbush 2001).

The simple query “review microarray analysis” from PubMed revealed 72 articles from Medline in April 2001 and 117 in August 2001.

It seems reasonable to hypothesize that genes with similar expression profiles,

6.1 Discovery of the putative transcription factor binding sites 89

Figure 6.2: The distribution of all patterns (of unrestricted length) with at most one wildcard symbol in the regions 250:: 150(upstream from the ORFs) and randomly chosen genomic regions of length 100 bp. Dots in the left column cor-respond to patterns that occur in x sequences from the random regions (along horizontal axis) andysequences from the upstream regions (vertical axis). In the right column the upstream regions are replaced by another set of random regions, therefore these plots show the expected statistics if the regions are chosen at ran-dom. Top row – all patterns with at least 10 occurrences. Second row – the subset of the patterns in the top row containing at least two characters CorG and not containing any of the substrings AAAA, TTTT, ATAT, orTATA. Bottom row – the same plot as in the second row, but only including patterns with at most 200 occurrences in upstream or random regions (i.e., zoomed to the lower left corner).

90 6 APPLICATIONS AND EXPERIMENTAL RESULTS

i.e., genes that are coexpressed, may share something common in their regulatory mechanisms, i.e. may be coregulated. Therefore, by clustering together genes with similar expression profiles we find groups of potentially coregulated genes allowing one to search for putative regulatory signals.

The first whole-genome microarray gene expression data set published was a diauxic shift experiment performed on yeast Saccharomyces cerevisiae, where expression levels for all genes during a metabolic shift from fermentation to respi-ration due to glucose starvation were measured in two-hour intervals (DeRisi, Iyer,

& Brown 1997). The authors identified several distinct clusters in the gene expres-sion profiles, and were able to show the presence of several previously character-ized transcription factor binding sites (for example the stress responsive element CCCCT) located upstream to many of the genes in those clusters. Fascinated by these results, many researchers asked the following quite natural question - ”can we identify novel putative binding sites automatically by combining gene expres-sion data clustering and sequence pattern discovery methods?”

The same data set of diauxic shift was soon analyzed by several other groups (van Helden, Andr´e, & Collado-Vides 1998; Brazma et al. 1998b). Van Helden et al. (van Helden, Andr´e, & Collado-Vides 1998), searched for oligonucleotides overrepresented upstream to potentially coregulated genes (clusters from the pa-per of (DeRisi, Iyer, & Brown 1997)) and showed that potential new transcription factor binding sites can be found in this way.

We used the information about the gene expression profiles to extract smaller sets of genes that potentially share similar regulation mechanisms and maybe also transcription factor binding sites (Brazma et al. 1998b). We used the data from the yeast gene expression studies reported in (DeRisi, Iyer, & Brown 1997)1 and clustered the genes by similarities in their expression profiles in several alternative ways. These clusters were used for the discovery of patterns “characteristic” to the upstream regions of these clusters, i.e., patterns with high ratingRt

(;S

+

;S ), whereS+are the sequences from the cluster, and S are the other upstream se-quences. Some of the patterns with a high rating Rt

(;S

+

;S ) were already known transcription factor binding sites. Some examples of the discovered pat-terns areCCCCTmatching 64% (35 out of 55) of sequences in the respective clus-ter and 21% (1280 out of 5921) of remaining upstream regions (and thus getting a score of 2.95),C..CCC.T(score 2.88),T.C..CCC(score 2.85), andT.AGGG (score 2.27). Moreover, it was shown that many of these binding sites could not be discovered from the global comparison of all upstream sequences against random genomic regions.

Later, more expression studies have been carried out under various

condi-1J. F. DeRisi et al. studied the relative expression rate changes of all (over 6000) genes of yeast during the diauxic shift from anaerobic (fermentation) to aerobic (respiration) metabolism.

6.1 Discovery of the putative transcription factor binding sites 91 tions (e.g. sporulation (Chu et al. 1998) and cell cycle studies (Cho et al. 1998;

Spellman et al. 1998) and the amount of the expression data is increasing rapidly. Simultaneously, pattern discovery methods have been developed and ap-plied. Surveys of these studies have appeared (Zhang 1999; Vilo & Kivinen 2001;

Ohler & Niemann 2001; Werner 2001).

As noted by several authors, the task of identifying promoter sequences can be very difficult (Vanet, Marsan, & Sagot 1999; Sinha & Tompa 2000; Vilo et al. 2000). The reasons include the uncertainty in promoter region prediction, the noise level in microarray expression measurements, the question about what the appropriate motif description language is, and the algorithmic problems of identifying subtle signals from sets of sequences that do not even need to share the same motifs. Algorithms used for transcription factor binding site prediction may have to detect only marginally overrepresented patterns in sets of hundreds or thousands of sequences of length of thousands of nucleotides (Vilo et al. 2000).

The transcription factor binding sites do not usually act alone. It is assumed that genome-wide control is achieved by a combinatorial use of multiple sequence elements (Werner 1999).

The aim of our studies is to mine automatically for new, statistically significant patterns in putative regulatory regions of genes. Such data mining experiments are not a substitute for “conventional single-gene dissections” (Zhang 1999). Their aim is instead to explore simultaneously thousands of genes in silico (which can-not be done by conventional methods) to generate targets for conventional studies in vitro.

In document Pattern Discovery from Biosequences (sivua 95-98)