ILLUSTRATIVE DEMANDS FROM ASTRONOMY AND PHYSICS
The mathematical sciences continue to encounter productive challenges from physics, and there may be new opportunities in the coming years. For example, there are still open mathematical problems stemming from general relativity and from the mathematical descriptions of black holes and their rotation. The Laser Interferometer Gravitational-Wave Observatory (LIGO) project might produce data to stimulate these directions. In another direction, recent mathematical advances in understanding convergence properties of the Boltzmann equation might open the door to progress on some fundamental issues.
For some specific demands, the committee examined the 2003 NRC report The Sun to the Earth—and Beyond: A Decadal Research Strategy in Solar and Space Physics. This report poses several major challenges for that field that in turn pose associated challenges for the mathematical sciences. For example,
Challenge 1: Understanding the structure and dynamics of the Sun’s interior, the generation of solar magnetic fields, the origin of the solar cycle, the causes of solar activity, and the structure and dynamics of the corona.1
This challenge will require advances in multiscale methods and complex simulations involving turbulence.
1 National Research Council, 2003. The Sun to the Earth—and Beyond: A Decadal Research Strategy in Solar and Space Physics. The National Academies Press, Washington, D.C., p. 2.
Later, that same report identified the following additional challenges that can only be addressed through related advances in the mathematical sciences:
In the coming decade, the deployment of clusters of satellites and large arrays of ground-based instruments will provide a wealth of data over a very broad range of spatial scales. Theory and computational models will play a central role, hand in hand with data analysis, in integrating these data into first-principles models of plasma behavior. . . . The Coupling Complexity research initiative will address multiprocess coupling, nonlinearity, and multiscale and multiregional feedback in space physics. The program advocates both the development of coupled global models and the synergistic investigation of well-chosen, distinct theoretical problems. For major advances to be made in understanding coupling complexity in space physics, sophisticated computational tools, fundamental theoretical analysis, and state-of-the-art data analysis must all be integrated under a single umbrella program.2
The coming decade will see the availability of large space physics databases that will have to be integrated into physics-based numerical models. . . . The solar and space physics community has not until recently had to address the issue of data assimilation as seriously as have the meteorologists. However, this situation is changing rapidly, particularly in the ionospheric arena.3
Another example comes from the 2008 NRC report The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. This report identified the major research challenges in four disparate fields and the subset that depend on advances in computing. Those advances in computing are innately tied to research in the mathematical sciences. In the case of astrophysics, the report identified the following essential needs:
• N-body codes. Required to investigate the dynamics of collisionless dark matter, or to study stellar or planetary dynamics. The mathematical model is a set of first-order ODEs for each particle, with acceleration computed from the gravitational interaction of each particle with all the others. Integrating particle orbits requires standard methods for ODEs, with variable time stepping for close encounters. For the gravitational acceleration (the major computational challenge), direct summation, tree algorithms, and grid-based methods are all used to compute the gravitational potential from Poisson’s equations.
2 Ibid., pp. 64-66.
3 Ibid., pp. 89-90.
• PIC codes. Required to study the dynamics of weakly collisional, dilute plasmas. The mathematical model consists of the relativistic equations of motion for particles, plus Maxwell’s equations for the electric and magnetic fields they induce (a set of coupled first-order PDEs). Standard techniques are based on particle-in-cell (PIC) algorithms, in which Maxwell’s equations are solved on a grid using finite-difference methods and the particle motion is calculated by standard ODE integrators.
• Fluid dynamics. Required for strongly collisional plasmas. The mathematical model comprises the standard equations of compressible fluid dynamics (the Euler equations, a set of hyperbolic PDEs), supplemented by Poisson’s equation for self-gravity (an elliptic PDE), Maxwell’s equation for magnetic fields (an additional set of hyperbolic PDEs), and the radiative transfer equation for photon or neutrino transport (a high-dimensional parabolic PDE). A wide variety of algorithms for fluid dynamics are used, including finite-difference, finite-volume, and operator-splitting methods on orthogonal grids, as well as particle methods that are unique to astrophysics—for example, SPH. To improve resolution across a broad range of length scales, grid-based methods often rely on static and adaptive mesh refinement (AMR). The AMR methods greatly increase the complexity of the algorithm, reduce the scalability, and complicate effective load-balancing yet are absolutely essential for some problems.
• Transport problems. Required to calculate the effect of transport of energy and momentum by photons or neutrinos in a plasma. The mathematical model is a parabolic PDE in seven dimensions. Both grid-based (characteristic) and particle-based (Monte Carlo) methods are used. The high dimensionality of the problem makes first-principles calculations difficult, and so simplifying assumptions (for example, frequency-independent transport, or the diffusion approximation) are usually required.
• Microphysics. Necessary to incorporate nuclear reactions, chemistry, and ionization/recombination reactions into fluid and plasma simulations. The mathematical model is a set of coupled nonlinear, stiff ODEs (or algebraic equations if steady-state abundances are assumed) representing the reaction network. Implicit methods are generally required if the ODEs are solved. Implicit finite-difference methods for integrating realistic networks with dozens of constituent species are extremely costly.4
In its look at atmospheric sciences, that same report identified the following necessary computational advances:
[Advancing the state of atmospheric science research requires] the development of (1) scalable implementations of uniform-grid methods aimed at
4 NRC, 2008. The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Sciences and Engineering. The National Academies Press, Washington, D.C., p. 31.
the very highest performance and (2) a new generation of local refinement methods and codes for atmospheric, oceanic, and land modeling. . . . The propagation of uncertainty through a coupled model is particularly problematic, because nonlinear interactions can amplify the forced response of a system. In addition, it is often the case that we are interested in bounding the uncertainties in predictions of extreme, and hence rare, events, requiring a rather different set of statistical tools than those to study means and variances of large ensembles. New systematic theories about multiscale, multiphysics couplings are needed to quantify relationships better. This will be important as atmospheric modeling results are coupled with economic and impact models. Building a better understanding of coupling and the quantification of uncertainties through coupled systems is necessary groundwork for supporting the decisions that will be made based on modeling results.5
ILLUSTRATIVE DEMANDS FROM ENGINEERING
The 2008 National Academy of Engineering report Grand Challenges for Engineering6 identified 14 major challenges. Following is a list of 11 of those challenges that depend on corresponding advances in the mathematical sciences, along with thoughts about the particular mathematical science research that will be needed.
• Make solar energy economical. This will require multiscale modeling of heterogeneous materials and better algorithms for modeling quantum-scale behaviors, and the mathematical sciences will contribute to both.
• Provide energy from fusion. This will require better methods for simulating multiscale, complex behavior, including turbulent flows, a topic challenging both mathematical scientists and domain scientists and engineers.
• Develop carbon sequestration methods. This will require better models of porous media and methods for modeling very large-scale heterogenous and multiphysics systems.
• Advance health informatics. Requires statistical research to enable more precise and tailored inferences from increasing amounts of data.
• Engineer better medicines. Requires tools for bioinformatics and simulation tools for modeling molecular interactions and cellular machinery.
• Reverse-engineer the brain. Requires tools for network analysis, models of cognition and learning, and signal analysis.
5 Ibid. p. 58.
• Prevent nuclear terror. Network analysis can contribute to this, as can cryptology, data mining, and other intelligence tools.
• Secure cyberspace. Requires advances in cryptography and theoretical computer science.
• Enhance virtual reality. Requires improved algorithms for scene rendering and simulation.
• Advance personalized learning. Requires advances in machine learning.
• Engineer the tools of scientific discovery. Requires advances to enable multiscale simulations, including improved algorithms, and improved methods of data analysis.
Engineering in general is dependent on the mathematical sciences, and that dependency is strengthening as engineers push toward ever greater precision. As just one illustration of the pervasiveness of the mathematical sciences in engineering, note the following examples of major improvements in manufacturing that were provided at the 2011 annual meeting of the National Academy of Engineering by Lawrence Burns, retired vice president for R&D and strategic planning at General Motors Corporation:
1. Simultaneous engineering,
2. Design for manufacturing,
3. Math-based design and engineering (computer-aided design),
4. Six-sigma quality,
5. Supply chain management,
6. The Toyota production system, and
7. Life-cycle analysis.7
It is striking that six of the seven items are inextricably linked with the mathematical sciences. While it is obvious that Item 3 depends on mathematical advances and Item 4 relies on statistical concepts and analyses, Items 1, 2, 5, and 7 are all dependent on advances in simulation capabilities that have enabled them to represent processes of ever-increasing complexity and fidelity. Research in the mathematical sciences is necessary in order to create those capabilities.
ILLUSTRATIVE DEMANDS FROM NETWORKING AND INFORMATION TECHNOLOGY
The 2010 report from the President’s Council of Advisors on Science and Technology (PCAST) Designing a Digital Future: Federally Funded
Research and Development in Networking and Information Technology8 identified four major recommendations for “initiatives and investments” in networking and information technology (NIT) to “achieve America’s priorities and advance key NIT research frontiers.” Three of those initiatives are dependent on ongoing research in areas of the mathematical sciences. First, the mathematical sciences are fundamental to advances in simulation and modeling:
The federal government should invest in a national, long-term, multi-agency, multi-faceted research initiative on NIT for energy and transportation. . . . Current research in the computer simulation of physical systems should be expanded to include the simulation and modeling of proposed energy-saving technologies, as well as advances in the basic techniques of simulation and modeling.9
Second, the mathematical sciences underpin cryptography and provide tools for systems analyses:
The federal government should invest in a national, long-term, multi-agency research initiative on NIT that assures both the security and the robustness of cyber-infrastructure
—to discover more effective ways to build trustworthy computing and communications systems,
—to continue to develop new NIT defense mechanisms for today’s infrastructure, and most importantly,
—to develop fundamentally new approaches for the design of the underlying architecture of our cyber-infrastructure so that it can be made truly resilient to cyber-attack, natural disaster, and inadvertent failure.10
Third, the mathematical sciences are strongly entwined with R&D for privacy protection, large-scale data analysis, high-performance computing, and algorithms:
The federal government must increase investment in those fundamental NIT research frontiers that will accelerate progress across a broad range of priorities . . . [including] research program on the fundamentals of privacy protection and protected disclosure of confidential data . . . fundamental
8 PCAST, 2010, Report to the President and Congress: Designing a Digital Future: Federally Funded Research and Development in Networking and Information Technology, p. xi. Available at http://www.whitehouse.gov/sites/default/files/microsites/ostp/pcast-nitrd-report-2010.pdf.
9 Ibid., p. xi.
10 Ibid., p. xii.
research in data collection, storage, management, and automated large-scale analysis based on modeling and machine learning . . . [and] continued investment in important core areas such as high performance computing, scalable systems and networking, software creation and evolution, and algorithms.11
The report goes on to point out that five broad themes cut across its recommendations, including the need for capabilities to exploit ever-increasing amounts of data, to improve cybersecurity, and to better ensure privacy. As noted above, progress on these fronts will require strong mathematical sciences research.
More generally, the mathematical sciences play a critical role in any science or engineering field that involves network structures. In the early years of U.S. telecommunications, very few networks existed, and each was under the control of a single entity (such as AT&T). For more than 50 years in this environment, relatively simple mathematical models of call traffic were extremely useful for network management and planning. The world is completely different today: It is full of diverse and overlapping networks, including the Internet, the World Wide Web, wireless networks operated by multiple service providers, and social networks, as well as networks arising in scientific and engineering applications, such as networks that describe intra- and intercellular processes in biology. Modern technologies built around the Internet are consumers of mathematics: for example, many innovations at Google, such as search, learning, and trend discovery, are based on the mathematical sciences.
In settings ranging from the Internet to transportation networks to global financial markets, interactions happen in the context of a complex network.12 The most striking feature of these networks is their size and global reach: They are built and operated by people and agents of diverse goals and interests. Much of today’s technology depends on our ability to successfully build and maintain systems used by such diverse set of users, ensuring that participants cooperate despite their diverse goals and interests. Such large and decentralized networks provide amazing new opportunities for cooperation, but they also present large challenges. Despite the enormous amount of data collected by and about today’s networks, fundamental mathematical science questions about the nature, structure, evolution, and security of networks remain that are of great interest for the government, for innovation-driven businesses, and for the general public.
11 Ibid., p. xiii.
12 The remainder of this section draws from pages 5, 6, and 10 of Institute for Pure and Applied Mathematics, 2012, Report from the Workshop on Future Directions in Mathematics. IPAM, Los Angeles.
For example, the sheer size of these networks makes it difficult to study them. The interconnectedness of financial networks makes financial interactions easy, but the financial crisis of 2008 provides a good example of the dangers of that interconnectivity. Understanding to what extent networks are susceptible to such cascading failures is an important area of study. From a very different domain, the network structure of the Web is what makes it most useful: Links provide the Web with a network structure and help us navigate it. But linking is also an endorsement of some sort. This network of implicit endorsements is what allows search companies such as Google to effectively find useful pages. Understanding how to harness such a large network of recommendations continues to be a challenge.
Game theory provides a mathematical framework that helps us understand the expected effects of interactions and to develop good design principles for building and operating such networks. In this framework we think of each participant as a player in a noncooperative game wherein each player selects a strategy, selfishly trying to optimize his or her own objective function. The outcome of the game for each participant depends, not only on his own strategy, but also on the strategies chosen by all other players. This emerging area is combining tools from many mathematical areas, including game theory, optimization, and theoretical computer science. The emergence of the Web and online social systems also gives graph theory an important new application domain.
ILLUSTRATIVE DEMANDS FROM BIOLOGY
The 2009 NRC report A New Biology for the 21st Century13 had a great deal to say about emerging opportunities that will rely on advances in the mathematical sciences. Following are selected quotations:
• Fundamental understanding will require . . . computational modeling of their growth and development at the molecular and cellular levels. . . . The New Biology––integrating life science research with physical science, engineering, computational science, and mathematics–– will enable the development of models of plant growth in cellular and molecular detail. Such predictive models, combined with a comprehensive approach to cataloguing and appreciating plant biodiversity and the evolutionary relationships among plants, will allow scientific plant breeding of a new type, in which genetic changes can be targeted in a way that will predictably result in novel crops and crops adapted to their conditions of growth. . . . New quantitative methods—the methods of the New Biology—are being developed that use next-generation DNA sequencing
13 National Research Council, 2009, A New Biology for the 21st Century. The National Academies Press, Washington, D.C.
to identify the differences in the genomes of parental varieties, and to identify which genes of the parents are associated with particular desired traits through quantitative trait mapping.14
• Fundamental advances in knowledge and a new generation of tools and technologies are needed to understand how ecosystems function, measure ecosystem services, allow restoration of damaged ecosystems, and minimize harmful impacts of human activities and climate change. What is needed is the New Biology, combining the knowledge base of ecology with those of organismal biology, evolutionary and comparative biology, climatology, hydrology, soil science, and environmental, civil, and systems engineering, through the unifying languages of mathematics, modeling, and computational science. This integration has the potential to generate breakthroughs in our ability to monitor ecosystem function, identify ecosystems at risk, and develop effective interventions to protect and restore ecosystem function.15
• Recent advances are enabling biomedical researchers to begin to study humans more comprehensively, as individuals whose health is determined by the interactions between these complex structural and metabolic networks. On the path from genotype to phenotype, each network is interlocked with many others through intricate interfaces, such as feedback loops. Study of the complex networks that monitor, report, and react to changes in human health is an area of biology that is poised for exponential development. . . . Computational and modeling approaches are beginning to allow analysis of these complex systems, with the ultimate goal of predicting how variations in individual components affect the function of the overall system. Many of the pieces are identified, and some circuits and interactions have been described, but true understanding is still well beyond reach. Combining fundamental knowledge with physical and computational analysis, modeling and engineering, in other words, the New Biology approach, is going to be the only way to bring understanding of these complex networks to a useful level of predictability.16
• Such complex events as how embryos develop or how cells of the immune system differentiate . . . must be viewed from a global yet detailed perspective because they are composed of a collection of molecular mechanisms that include junctions that interconnect vast networks of genes. It is essential to take a broader view and analyze entire gene regulatory networks, and the circuitry of events underlying complex biological systems. . . . Analysis of developing and differentiating systems at a network level will be critical for understanding complex events of how tissues and organs are assembled. . . . Similarly, networks of proteins interact at a biochemical level to form complex metabolic machines that produce distinct cellular products. Understanding these
14 Ibid., pp. 22-23.
15 Ibid., p. 25.
16 Ibid., pp. 34-35.
and other complex networks from a holistic perspective offers the possibility of diagnosing human diseases that arise from subtle changes in network components.17
• Perhaps the most complex, fascinating, and least understood networks involve circuits of nerve cells that act in a coordinated fashion to produce learning, memory, movement, and cognition. . . . Understanding networks will require increasingly sophisticated, quantitative technologies to measure intermediates and output, which in turn may demand conceptual and technical advances in mathematical and computational approaches to the study of networks.18
Later, the report discusses some specific ways in which the mathematical sciences are enabling the New Biology:
[M]athematical underpinnings of the field . . . embrace probabilistic and combinatorial methods. Combinatorial algorithms are essential for solving the puzzles of genome assembly, sequence alignment, and phylogeny construction based on molecular data. Probabilistic models such as Hidden Markov models and Bayesian networks are now applied to gene finding and comparative genomics. Algorithms from statistics and machine learning are applied to genome-wide association studies and to problems of classification, clustering and feature selection arising in the analysis of large-scale gene expression data.19
As another illustration, the 2008 NRC report The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering20 identified essential challenges for evolutionary biology, a number of which are strongly mathematical. Following are selected quotations:
• Standard phylogenetic analysis comparing the possible evolutionary relationships between two species can be done using the method of maximum parsimony, which assumes that the simplest answer is the best one, or using a model-based approach. The former entails counting character change on alternative phylogenetic trees in order to find the tree that minimizes the number of character transformations. The latter incorporates specific models of character change and uses a minimization criterion to choose among the sampled trees, which often involves finding the tree with the highest likelihood. Counting, or optimizing, change on a tree, whether in a parsimony or model-based framework, is a computationally efficient problem. But sampling all possible trees to
17 Ibid., p. 35.
19 Ibid., p. 62.
20 NRC, 2008, The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. The National Academies Press, Washington, D.C.
find the optimal solution scales precipitously with the number of taxa (or sequences) being analyzed. . . . Thus, it has long been appreciated that finding an exact solution to a phylogenetic problem of even moderate size is NP complete. . . . Accordingly, numerous algorithms have been introduced to search heuristically across tree space and are widely employed by biological investigators using platforms that range from desktop workstations to supercomputers. These algorithms include methods [such as] simulated annealing, genetic (evolutionary) algorithmic searches, and Bayesian Markov chain Monte Carlo (MCMC) approaches.21
• The accumulation of vast amounts of DNA sequence data and our expanding understanding of molecular evolution have led to the development of increasingly complex models of molecular evolutionary change. As a consequence, the enlarged parameter space required by these molecular models has increased the computational challenges confronting phylogeneticists, particularly in the case of data sets that combine numerous genes, each with their own molecular dynamics. . . . Phylogeneticists are more and more concerned about having statistically sound measures of estimating branch support. In model-based approaches, in particular, such procedures are computationally intensive, and the model structure scales significantly with the size of the number of taxa and the heterogeneity of the data. In addition, more attention is being paid to statistical models of molecular evolution.22
• A serious major computational challenge is to generate qualitative and quantitative models of development as a necessary prelude to applying sophisticated evolutionary models to understand how developmental processes evolve. Developmental biologists are just beginning to create the algorithms they need for such analyses, based on relatively simple reaction-rate equations, but progress is rapid. . . . Another important breakthrough in the field is the analysis of gene regulatory networks [which] describe the pathways and interactions that guide development.23
• Modern sequencing technologies routinely yield relatively short fragments of a genomic sequence, from 25 to 1,000 [base pairs]. Whole genomes range in size from the typical microbial sequence, which has millions of base pairs, to plant and animal sequences, which often consist of billions of base pairs. Additionally, ‘metagenomic’ sequencing from environmental samples often mixes fragments from dozens to hundreds of different species and/or ecotypes. The challenge is to take these short subsequences and assemble them to reconstruct the genomes of species and/or ecosystems. . . . While the fragment assembly problem is NP-complete, heuristic algorithms have produced high-quality reconstructions of hundreds of genomes. The recent trend is toward
21 Ibid., p. 65.
23 Ibid., p. 72.
methods of sequencing that can inexpensively generate large numbers (hundreds of millions) of ultrashort sequences (25-50 bp). Technical and algorithmic challenges include the following:
—Parallelization of all-against-all fragment alignment computations.
—Development of methods to traverse the resulting graphs of fragment alignments to maximize some feature of the assembly path.
—Heuristic pruning of the fragment alignment graph to eliminate experimentally inconsistent subpaths.
—Signal processing of raw sequencing data to produce higher quality fragment sequences and better characterization of their error profiles.
—Development of new representations of the sequence-assembly problem—for example, string graphs that represent data and assembly in terms of words within the dataset.
—Alignment of error-prone resequencing data from a population of individuals against a reference genome to identify and characterize individual variations in the face of noisy data.
—Demonstration that the new methodologies are feasible, by producing and analyzing suites of simulated data sets.24
• Once we have a reconstructed genomic or metagenomic sequence, a further challenge is to identify and characterize its functional elements: protein-coding genes; noncoding genes, including a variety of small RNAs; and regulatory elements that control gene expression, splicing, and chromatin structure. Algorithms to identify these functional regions use both statistical signals intrinsic to the sequence that are characteristic of a particular type of functional region and comparative analyses of closely and/or distantly related sequences. Signal-detection methods have focused on hidden Markov models and variations on them. Secondary structure calculations take advantage of stochastic, context-free grammars to represent long-range structural correlations.25
• Comparative methods require the development of efficient alignment methods and sophisticated statistical models for sequence evolution that are often intended to quantitatively model the likelihood of a detected alignment given a specific model of evolution. While earlier models treated each position independently, as large data sets became available the trend is now to incorporate correlations between sites. To compare dozens of related sequences, phylogenetic methods must be integrated with signal detection.26
ILLUSTRATIVE DEMANDS FROM MEDICINE
There is increasing awareness in medicine of the benefits from collaboration with mathematical scientists, including in areas where past focus has
24 Ibid., pp. 77-78.
tended to be clinical and to have little reference to mathematics. In areas ranging from medical imaging, drug discovery, discovery of genes linked to hereditary diseases, personalized medicine, validating treatments, cost-benefit analysis, and robotic surgery, the mathematical sciences are ever-present. The role of the mathematical sciences in medicine has become so important that one can provide only a few examples.
One example is a program of the Office of Physcal Sciences Oncology,27 funded by the National Cancer Institute, to support innovative approaches to understanding and controlling cancer, which is explicitly seeking teams that may include mathematical scientists. Some of the associated mathematical science questions arise in creation of sophisticated models of cancer, optimization of cancer chemotherapy regimes, and rapid incorporation of data into models and treatment. (As shown in Appendix C, the National Institutes of Health overall allocates some $90 million per year for mathematical sciences research.) As another example, looking just at the February 2011 issue of Physical Biology, one sees articles whose titles contain the following mathematics-related phrases: “stochastic dynamics,” “micromechanics,” “an evolutionary game theoretical view,” and “an Ising model of cancer and beyond.” In the same vein, a research highlight published in Cancer Discovery in July 2011 describes development of “a statistical approach that maps out the order in which . . . [cancer-related] abnormalities arise.” It is increasingly common to find such cross-fertilization.
RESTORING REPRODUCIBILITY IN SCIENCE
A recent article in the Wall Street Journal opens with the following story:
Two years ago, a group of Boston researchers published a study describing how they had destroyed cancer tumors by targeting a protein called STK33. Scientists at biotechnology firm Amgen Inc. quickly pounced on the idea and assigned two dozen researchers to try to repeat the experiment with a goal of turning the findings into a drug. It proved to be a waste of time and money. After six months of intensive lab work, Amgen found it couldn’t replicate the results and scrapped the project. ‘I was disappointed but not surprised,’ says Glenn Begley, vice president of research at Amgen of Thousand Oaks, Calif. ‘More often than not, we are unable to reproduce findings’ published by researchers in journals.28
28 Gautam Naik, Scientists’ elusive goal: Reproducing study results. Wall Street Journal, December 2, 2011.
A related paper29 reports on a study done by Bayer Healthcare that examined 67 published research articles for which Bayer had attempted to replicate the findings in-house. Fewer than one-quarter were viewed as having been essentially replicated, while more than two-thirds exhibited major inconsistencies, in most cases leading Bayer to terminate the projects.
One reason for this lack of reproducibility is undoubtedly the pressure to publish and find positive results. Under this pressure, contradictory evidence can be swept under the carpet with rationalizations about why such evidence may safely be ignored. And, of course, some experiments will simply have been done wrong, having undetected sources of bias (e.g., contaminated materials).
Another reason for the lack of reproducibility is unsound statistical analysis. Causes range from scientifically improper selection of data (throwing out data that do not support the claim) to a weak understanding of even the simplest statistical methods. As an example of the latter, Nieuwenhuis et al.30 reviewed 513 behavioral, systems, and cognitive neuroscience articles in five top-ranking journals, looking for articles that specifically compared two experimental effects to see if they differed significantly. Out of the 157 articles that focused on such a comparison, they found that 78 used the correct statistical procedure of testing whether the difference of the two effects is significantly different from 0; 79 used a blatantly incorrect procedure. The incorrect procedure was to separately test each experimental effect against a null hypothesis of no effect and, if one effect was significantly different from zero but the other was not, to declare that there was a significant difference between the effects. Of course, this can happen even when the two effects are essentially identical (e.g., if the p-value for one is 0.06 and the p-value for the other is 0.05).
Ensuring sound statistical analysis has long been a problem for science, and this problem is only getting worse owing to the increasingly massive data available today which allows increasingly thorough exploration of the data for “artifacts.” These artifacts can be exciting new science or they can simply arise from noise in the data. The worry these days is that most often it is the latter.
In a recent talk about the drug discovery process (which the committee chooses not to cite in order to avoid embarrassing the presenter), the following numbers were given in illustration:
29 Florian Prinz, Thomas Schlange, and Khusru Asadullah, 2011, Believe it or not: How much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery 10:712.
30 Sander Nieuwenhuis, Birte U Forstmann, and Eric-Jan Wagenmakers, 2011, Erroneous analyses of interactions in neuroscience: a problem of significance. Nature Neuroscience 14:1105-1107.
• Ten thousand relevant compounds were screened for biological activity.
• Five hundred passed the initial screen and were subjected to in vitro experiments.
• Twenty-five passed this screening and were studied in Phase I animal trials.
• One passed this screening and was studied in an expensive Phase II human trial.
These numbers are completely compatible with the presence of nothing but noise, assuming the screening was done based on statistical significance at the 0.05 level, as is common. (Even if none of the 10,000 compounds has any effect, roughly 5 percent of the compounds would appear to have an effect at the first screening; roughly 5 percent of the 500 screened compounds would appear to have an effect at the second screening, etc.)
This problem, often called the “multiplicity” or “multiple testing” problem, is of central importance not only in drug development, but also in microarray and other bioinformatic analyses, syndromic surveillance, high-energy physics, high-throughput screening, subgroup analysis, and indeed any area of science facing an inundation of data (which most of them are). The section of Chapter 2 on high-dimensional data indicates the major advances happening in statistics and mathematics to meet the challenge of multiplicity and to help restore the reproducibility of science.