medical sciences and beyond.16 In a nutshell, the FDR procedure controls the expected ratio between the number of false rejections and the total number of rejections. Returning to our example above, this allows the statistician to return a list of genes to the medical researcher assuring her that she should expect at most a known fraction of these genes, say 10 percent, to be “false discoveries.” This new paradigm has been extremely successful, for it enjoys increased power (the ability of making true discoveries) while simultaneously safeguarding against false discoveries. The FDR methodology assumes that the hypotheses being tested are statistically independent and that the data distribution under the null hypothesis is known. These assumptions may not always be valid in practice, and much of statistical research is concerned with extending statistical methodology to these challenging setups. In this direction, recent years have seen a resurgence of empirical Bayes techniques, made possible by the onslaught of data and providing a powerful framework and new methodologies to deal with some of these issues.
Estimation problems are also routinely high-dimensional. In a genetic association study, n subjects are sampled and one or more quantitative traits, such as cholesterol level, are recorded. Each subject is also measured at p locations on the chromosomes. For instance, one may record a value (0, 1, or 2) indicating the number of copies of the less-common allele observed. To find genes exhibiting a detectable association with the trait, one can cast the problem as a high-dimensional regression problem. That is to say, one seeks to express the response of interest (cholesterol level) as a linear combination of the measured genetic covariates; those covariates with significant coefficients are linked with the trait.
The issue is that the number n of samples (equations) is in the thousands while the number p of covariates is in the hundreds of thousands. Hence, we have far fewer equations than unknowns, so what shall we do? This is a burning issue because such underdetermined systems arise everywhere in science and engineering. In magnetic resonance imaging, for example, one would like to infer a large number of pixels from just a small number of linear measurements. In many problems, however, the solution is assumed to be sparse. In the example above, it is known that only a small number of genes can potentially be associated with a trait. In medical imaging, the image we wish to form typically has a concise description in a carefully chosen representation.
In recent years, statisticians and applied mathematicians have developed a flurry of highly practical methods for such sparse regression problems. Most of these methods rely on convex optimization, a field that has
16 As of January 15, 2012, Google Scholar reported 12,861 scientific papers citing the original article of Benjamini and Hochberg.