small P, large n contexts, estimates of the parameter set are expected to improve as the number of data points increases, and much of the machinery of classical statistics addresses trade-offs between the reliability of parameter estimates and the number of samples analyzed.
In many biological research settings, the statistical challenge is quite different. Individual experiments—for example, a microarray-based measurement of the levels of thousands of messenger-RNA levels in a single RNA sample extracted from a particular tumor—are often information-rich. However, the number of independent measurements from which a biologist seeks to draw conclusions (e.g., the number of tumors analyzed) may be quite small. Similarly, geneticists are now contemplating measuring the genetic variants present at ~105 sites across the genomes of individual research subjects even though the number of individuals—for example, the number of cases and controls in a disease-susceptibility study—is under severe practical constraints. The challenge in these situations is analogous to attempting to reach conclusions from a moderate number of photographs of, for example, profitable and unprofitable restaurants. Although increasing the number of restaurants photographed would certainly improve the reliability of the study, presuming that the sampling strategy was well considered, success would depend even more heavily on strategies for representing and modeling the immense amount of information in each photograph. Interest among biologists in problems with similar statistical properties has grown dramatically in recent years. The committee examines this phenomenon here as a prime example of a crosscutting mathematical theme on the interface between mathematics and contemporary biology. It illustrates both the progress that has been made during the past decade and the challenges that lie ahead.
Although the small n, large P problem is encountered in many biological contexts, the challenges of interpreting gene-expression data provide a prototypical example that is of substantial current interest. The development of microarray technology, which can yield the transcription profiles of >104 genes in a single experiment, has enabled global approaches to understanding regulatory processes in normal or disease states. Substantial work has been done on the selection and analysis of differentially expressed genes for purposes ranging from the discovery of new gene functions and the classification of cell types to the prediction of clinically important biological phenotypes (Nature Genetics Supplement 21, 1999; Golub et al., 1999; Tamayo et al., 1999; Nature Genetics Supple-