EPCLUST - Simultaneous discovery of patterns and subfamilies

6.3 Simultaneous discovery of patterns and subfamilies

7.1.2 EPCLUST

EPCLUST (Expression Profile data CLUSTering and analysis) is the module for gene expression data matrix analysis, including data selection and filtering, clus-tering, visualization, and similarity searches. Additionally, it supports missing data imputation, data randomization, and data rescaling.

EPCLUST provides a folder-based analysis environment to operate with data sets in server-side folders. The folders serve as secure data authoring, analysis and publishing environments with multi-user access capability over the web, allowing distributed collaboration.

Raw data may be uploaded to EPCLUST in a number of different formats, either from the users’ own computers or from gene expression databases available on the web. These raw data then should be filtered to create an analysis data set.

Upon the transfer of a large data set to an EPCLUST server, the user will often select only part of the data for a particular purpose. For example, after uploading the data set of the expression of an entire genome in a time-course experiment

122 7 SOFTWARE TOOLS

with numerous measurements, the user may desire to focus on specific time-points or experiments, and only on those genes that have shown significant differential expression under certain specified conditions.

The rows and columns of data have tags by which they can be identified. The row annotations are read from a file that provides a description for every row ID.

These descriptions may contain HTML markup, to link to corresponding entries in a database, or to highlight the key biological roles known for the genes. The annotations will be shown together with the clustering results, facilitating data interpretation, if the annotations were created carefully and with detail.

The prepared data files are stored in data folders, where various analysis can be performed. The intermediate (e.g. the all against all distance matrices) as well as the final analysis results (clustering results and visualizations) are also stored in the same folders, where they can be viewed later without having to redo the same analysis steps.

Distance measures

Cluster analysis is a type of classification algorithm that is intended to orga-nize a collection of objects into meaningful structures. This meaning is given by the relative proximity of objects within one cluster in comparison to objects in other clusters. There exist many different distance measures for the comparison of expression profiles. These measures are grouped in EPCLUST on the basis of their general properties.

The first group consists of the standard metric distances: the Euclidean dis-tance, the Manhattan disdis-tance, and the average distance (Euclidean distance that has been normalized by the vector length).

The second group consists of distances that essentially measure the correlation (or angle) between vectors. Note that these distances capture the idea that the direction and relative intensity of the values are what is important, and not the absolute values and magnitudes of the change. This corresponds well with the tasks of expression analysis where the absolute change can be meaningless, while the direction and relative scale of the change are crucial.

The third group is based on measures that work on ranked expression vectors instead of on the original values (rank correlation), or measures that are applicable only to discrete (e.g. binary vectors) data.

Clustering methods

We implemented fast versions (see Table 7.3) of standard agglomerative (bot-tom up) hierarchical clustering methods and partitioning-based K-means methods, and incorporated them into the web interface with extensive visualization options.

The hierarchical clustering method requires that all pairwise distances are cal-culated first. For longer vectors, e.g. for many hybridizations analyzed simulta-neously, this can be the most time-consuming step of the analysis. The distance

7.1 Expression Profiler 123

Vector length 10 10 100 100 1000 1000

Number clustering distances clustering distances clustering distances

10 0.00 0.00 0.00 0.00 0.00 0.00

100 0.00 0.00 0.00 0.00 0.00 0.03

500 0.05 0.02 0.05 0.09 0.05 0.78

1000 0.19 0.08 0.20 0.35 0.21 3.12

2000 0.82 0.31 0.87 1.40 0.91 12.57

3000 2.41 0.71 2.60 3.17 2.67 28.46

4000 4.77 1.26 5.15 5.65 5.34 50.61

5000 8.02 1.95 8.70 8.81 9.04 79.58

6000 12.28 2.79 13.23 13.42 13.85 114.64

7000 17.08 3.80 18.45 17.94 19.43 156.53

8000 22.78 5.31 24.76 23.75 26.29 204.28

9000 31.52 6.53 34.87 31.26 34.97 270.30

10000 40.03 8.16 43.02 35.80 43.74 319.52

15000 96.46 27.03 101.33 82.23 108.99 719.68 20000 195.09 51.92 194.81 148.97 201.31 1279.05

Table 7.3: Times of all-against-all distance comparisons and hierarchical cluster-ing for up to 20,000 vectors of length 10, 100, and 1000 numeric attributes. Times are in seconds, as measured on an Alpha ES40 ev64 server on a single processor.

matrices are reused by different clustering methods. The hierarchical clustering methods include:

complete linkage using maximum distance between members of two clusters for defining the distance between clusters;

average linkage using weighted and unweighted group-wise average methods WPGMA and UPGMA;

single linkage using minimum distance;

The K-means clustering method requires users to provide the number of clus-ters as well as a method for choosing the initial cluster cenclus-ters. Starting from po-tential cluster centers the K-means procedure assigns all genes to their respective clusters (the cluster to whose center they are closest) and refines cluster centers to be the geometric centers of gravity for each defined cluster. This process is repeated until the clustering stabilizes or the number of allowed cycles is reached.

Other clustering methods are currently being added to EPCLUST. These in-clude the linear-time top-down divisive hierarchical clustering method SOTA (Self Organizing Trees (Herrero, Valencia, & Dopazo 2001)), and some variants of the standard Self Organising Map (SOM) algorithms (Kohonen 1997). For extensive coverage of clustering methods see, for example, (Legendre & Legendre 1998;

Everitt, Landau, & Leese 2001).

124 7 SOFTWARE TOOLS

Similarity searches

Similarity searches provide users with alternative straightforward ways to study gene expression without requiring to cluster the whole data set first.

The user can select genes of interest by typing in their IDs or by querying from data set annotations and then use these genes to query the entire data set for similar ones.

Genes whose expression profiles are similar under a chosen distance measure (a variety are available) are reported. To reduce the size of the meaningful an-swer, users can set cut-off thresholds for the number of genes to be displayed or for the maximal distance that is relevant to their query (e.g. the estimate of the significance threshold for a particular data set (Kruglyak & Tang 2001)).

There exists also an “opposite” search, in the sense that only the farthest dis-tance genes are displayed in the result set. This way the user can study the genes that are anti-correlated with the gene of interest.

The similarity search can be performed either on one gene at a time, or on a set of genes simultaneously. If many genes are used in the search together, the results can be displayed either for each one individually, or as a combined result merging all the individual results. This way, for example, starting from a tight cluster of co-expressed genes, one can search for other genes that have similar profiles to any of the original ones. This can be seen as generating a “supercluster”.

Visualization methods

The results of the cluster analysis and similarity searches are presented visu-ally in the heat-map format, made popular in the microarray community by Mike Eisen (Eisen et al. 1998). The results are presented in PNG (Portable Network Graphics) or GIF formats, with clickable regions for drilling into the expression data - e.g. exploring subtrees in a hierarchical clustering.

In document Pattern Discovery from Biosequences (sivua 128-131)