Brook, and math/physics initiatives at various universities, such as Duke University, the University of Pennsylvania, and the University of California at Santa Barbara.

NEW FRONTIERS IN STATISTICAL INFERENCE

We live in a new age for statistical inference, where technologies now produce high-dimensional data sets, often with huge numbers of measurements on each of a comparatively small number of experimental units. Examples include gene expression microarrays monitoring the expression levels of tens of thousands of genes at the same time and functional magnetic resonance imaging machines monitoring the blood flow in various parts of the brain. The breathtaking increases in data-acquisition capabilities are such that millions of parallel data sets are routinely produced, each with its own estimation or testing problem. This era of scientific mass production calls for novel developments in statistical inference, and it has inspired a tremendous burst in statistical methodology. More importantly, the data flood completely transforms the set of questions that needs to be answered, and the field of statistics has, accordingly, changed profoundly in the last 15 years. The shift is so strong that the subjects of contemporary research now have very little to do with general topics of discussion from the early 1990s.

High dimensionality refers to an estimation or testing problem in which the number of parameters about which we seek inference is about the same as, or much larger than, the number of observations or samples we have at our disposal. Such problems are everywhere. In medical research, we may be interested in determining which genes might be associated with prostate cancer. A typical study may record expression levels of thousands of genes for perhaps 100 men, of whom only half have prostate cancer and half serve as a control set. Here, one has to test thousands of hypotheses simultaneously in order to discover a small set of genes that could be investigated for a causal link to cancer development. Another example is genomewide association studies where the goal is to test whether a variant in the genome is associated with a particular phenotype. Here the subjects in a study typically number in the tens of thousands and the number of hypotheses may be anywhere from 500,000 to 2.5 million. If we are interested in a number of phenotypes, the number of hypotheses can easily rise to the billions and trillions, not at all what the early literature on multiple testing had in mind.

In response, the statistical community has developed groundbreaking techniques such as the false discovery rate (FDR) of Benjamini and Hochberg, which proposes a new paradigm for multiple comparisons and has had a tremendous impact not only on statistical science but also in the



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement