ment 32, 2002). The best of the markers that have emerged from this research have already shown promise for both diagnostic and prognostic clinical use.

Applications to tumor classification have attracted particular interest since it has been estimated that over 40,000 cancer cases per year in the United States present major classification challenges for existing clinical and histological approaches. Gene-expression microarrays for the first time offer the possibility of basing diagnoses on the global-gene-expression profile of the tumor cells. Moreover, the discovery of gene-expression patterns that are significantly correlated with tumor phenotype can clarify molecular mechanisms of pathogenesis and potentially identify new strategies for treatment (Shipp et al., 2002). Similar opportunities exist for many other poorly understood diseases. For example, the recent discovery that a set of genes in the oxidative-phosphorylation pathway is more highly expressed in the muscle biopsies of normal controls than in those of patients with Type 2 diabetes has opened new avenues in diabetes research (Mootha et al., 2003). Closer study of the most highly correlated genes in this set led to the hypothesis that PGC-1α might regulate this subset of genes, a result that was then confirmed by further laboratory study. By this route, an aberration in PGC-1α expression has become a prime candidate for being a step in disease progression.

The pattern-recognition techniques required for analyzing gene-expression data and other large biological data sets are often called supervised and unsupervised learning. Machine-learning tools based on these techniques, designed in collaborations between bioscientists and mathematical scientists, have already come into widespread use. Pattern recognition via supervised and unsupervised learning is based on quantitative, stochastic descriptions of the data, sometimes referred to as associative models. These models typically incorporate few or no assumptions about the mechanistic basis for the patterns that they seek to discover.

In unsupervised learning techniques, the structure in a data set is elucidated without using any a priori labeling of the data. Unsupervised learning can be useful during exploratory analysis. Supervised techniques create models for classifying data by training on labeled members of the classes that are to be distinguished—for example, invasive and noninvasive tumors. Supervised techniques have an advantage over unsupervised techniques because they are less subject to structure that is not directly relevant to the distinction of interest, such as the laboratory in which the data were collected. Unfortunately, training sets are not available in many biological situations.

General machine-learning algorithms that are potentially useful in this area of research stem from fields such as psychology and systematics. The



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement