Bioinformatics refers to the study of biological data using methods from mathematics,statistics, and computer science. In particular, functional genomics experiments producemassive amounts of high-dimensional data that need to be analyzed and understood [7, 8]. Recently developed methods such as oligonucleotide arrays and DNA chips (microarrays)can be used to measure the activity or expression level of each gene in a genome as afunction of time in various experimental settings.
The need for data driven methods for generating hypotheses about gene function is
already acknowledged in the functional genomics community (see for example [11]). Ex-ploratory methods for information visualization, developed in the Neural Networks Re-search Centre, would provide precisely the required tools in this first stage of analysis. Methods based on the learning metrics principle (Section 12), on their part, are neededto focus the analyses on the important parts of the very high-dimensional and noisy data,by finding the dependencies between the gene expression data, known functional classes ofgenes, and other kinds of biological databases including the gene sequences and propertiesof the corresponding proteins.
The project is carried out in collaboration with experts of the biological problem, and
with the other bioinformatics group of the laboratory led by prof. Mannila and prof. Hollm´en. The studies were started in late 2000 in collaboration with a group of theUniversity of Kuopio led by Prof. Eero Castr´en.
We started by exploratory data analysis of gene expression [3, 4, 9]. The Self-OrganizingMap (SOM) is particularly useful in this first stage of data analysis. The SOM constructsa nonlinear projection of the data to a map display which can be used for visualizingof similarity relationships and cluster structures, with methods developed in the NeuralNetworks Research Centre. Such a combination of non-parametric clustering and visual-ization distinguishes the SOM from the many clustering methods commonly applied togene data.
The same SOM display can be used for visualizing the relationships between data sets,
such as the gene expression and the functional classes of the genes below.
The genes of the yeast Saccharomyces Cerevisiae were first clustered based on their
expression in a set of different conditions and treatments such as the diauxic shift and aheat shock (in a public-domain data set). Visualizations of the cluster structure and therelationships of the clusters to the functional classes of the genes were constructed, and theclusters were interpreted in terms of the functional classes of the genes and their activityin the treatments. Figure 11.1 shows, for instance, that most genes related to cytoplasmicdegradation form a cluster and hence are expressed similarly in the set of treaments. Wedemonstrated how to use SOMs in the exploratory task and proposed new hypotheses onthe relationships between some functional classes.
Figure 11.1: a SOM-based visualization of the cluster structure in yeast gene expressiondata (from [1]). Light shades: clusters; dark shades: sparser areas or gaps in betweenclusters. The dots denote map units. Note that in this display there is a hexagon inbetween each pair of map units, whereas on the b and c figures only the map unitsthemselves are shown. The same SOM is shown in all figures. b: Distribution of thefunctional class ’cytoplasmic degradation’ (87 genes) on the SOM. c: Distribution of thefunctional class ’sugar and carbohydrate transporters’ (32 genes).
One of the ultimate goals in functional genetics is understanding the regulatory path-
ways of the genes. This is the key to understanding the dependencies between the genesand to controlling the processes within the cell. The pathways have been studied by“knocking out” one gene at a time by mutations and inspecting the effects on gene ex-pression. If the mutated gene is a vital part of some pathway, then the whole pathwaywill be blocked. We would thus expect to see all the critical genes in one pathway clustertogether because they cause similar effects for the expressions of all the other genes in thecell.
We analyzed the similarities of mutated yeast strains with Self-Organizing Maps [9].
The clusters that had earlier been found by hierarchical clustering [2] were found by theSOM as well, verifying the viability of the method (Fig. 11.2). We were additionally ableto propose some new groupings for the mutated yeast strains.
The conclusion from these first studies is that the SOM is a valuable addition to the
Figure 11.2: The Self-Organizing Map of the mutated yeast strains. Light shades denotedense areas in the expression space and dark sparse areas. The strains located near eachother on the map are also nearby in the expression space. The manually drawn lines groupthe clusters derived earlier by hierarchical clustering [2] (small circles and boxes denoteexceptions). As can be seen, approximately the same clusters can also be found based onthis SOM-display. The SOM display additionally visualizes the similarities of the clustersand the data between them.
Proper selection of the metric of the gene expression space is a well-acknowledged problemsince the data is high-dimensional, noisy, and contains uninteresting variation due to theseveral biological processes going on simultaneously within the cells. Our solution is toderive the metric from other biological data sets such as the functional classification (fordetails see Section 12).
More generally, there often exist several datasets derived from proteomics, gene ex-
pression, and genetic sequences, that could be combined to yield a more accurate pictureof cell function.
Discriminative clustering (DC) [10] is a principled way to derive the metric used in
clustering from combine auxiliary information. The first preliminary results of this ap-proach applied to gene expression data were presented in [5, 10]. The DC was able toform clusters in the gene expression space that were more homogeneous with respect tothe distribution of functional classes than the other methods. More detailed biologicalanalysis and interpretation of these results is in progress.
The learning metrics principle can be applied to Self-Organizing Maps as well [6]. An
application to gene expression data is in progress.
[1] Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster
analysis and display of genome-wide expression patterns. Proceedings of the NationalAcademy of Sciences, USA, 95:14863–14868, 1998.
[2] Timothy R. Hughes, Matthew J. Marton, Allan R. Jones, et al. Functional discovery
via a compendium of expression profiles. Cell, 102:109–126, 2000.
[3] Samuel Kaski, Janne Nikkil¨a, Petri T¨or¨onen, Eero Castren, and Garry Wong. Analysis
and visualization of gene expression data using self-organizing maps. In Proceedingsof NSIP-01, IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing. 2001.
[4] Samuel Kaski and Janne Sinkkonen. A topography-preserving latent variable model
with learning metrics. In Nigel Allinson, Hujun Yin, Lesley Allinson, and Jon Slack,editors, Advances in Self-Organizing Maps, pages 224–229. Springer, London, 2001.
[5] Samuel Kaski, Janne Sinkkonen, and Janne Nikkil¨a. Clustering gene expression data
by mutual information with gene function. In Georg Dorffner, Horst Bischof, and KurtHornik, editors, Artificial Neural Networks—ICANN 2001, pages 81–86. Springer,Berlin, 2001.
[6] Samuel Kaski, Janne Sinkkonen, and Jaakko Peltonen. Bankruptcy analysis with
self-organizing maps in learning metrics. IEEE Transactions on Neural Networks,12:936–947, 2001.
[7] Nature Genetics Supplement. The chipping forecast. Nature Genetics, 21(1):1–60,
[8] Nature Insight. Functional genomics. Nature, 405:819–846, 2000.
[9] Merja Oja, Janne Nikkil¨a, Petri T¨or¨onen, Garry Wong, Eero Castr´en, and Samuel
Kaski. Exploratory clustering of gene expression profiles of mutated yeast strains. In
Wei Zhang and Ilya Shmulevich, editors, Computational And Statistical ApproachesTo Genomics. Kluwer, 2002. In press.
[10] Janne Sinkkonen and Samuel Kaski. Clustering based on conditional distributions in
an auxiliary space. Neural Computation, 14:217–239, 2002.
[11] Ognjenka Goga Vukmirovic and Shirley M. Tilghman. Exploring genome space. Na-
Dr. rer. nat. Frank GollnickForschungsgemeinschaft FunkJuutilainen, 1997: Juutilainen J, Lang S: “Genotoxic, carcinogenic and teratogenic effects of electromagnetic fields. Introduction and overview” in: Mutat Res 1997; 387 (3): 165 - 171Brusick, 1998: Brusick D, Albertini R, Mc Ree D, Peterson D, Williams G, Hanawalt P, Preston J: “Genotoxicity of radiofrequency radiation. DNA/Genetox Exp
FOR IMMEDIATE RELEASE Microbia Contact: Contact: MICROBIA AND FOREST LABORATORIES ANNOUNCE PRELIMINARY RESULTS OF LINACLOTIDE PHASE 2B STUDIES — Chronic constipation and IBS-C studies each meet primary endpoint — CAMBRIDGE, MASS. and NEW YORK, March 4, 2008 — Microbia, Inc. and Forest Laboratories, Inc. (NYSE: FRX) today announced positive top-line results