Examples of the Mathematical Sciences Support of Science and Engineering

**ILLUSTRATIVE DEMANDS FROM ASTRONOMY AND PHYSICS**

The mathematical sciences continue to encounter productive challenges from physics, and there may be new opportunities in the coming years. For example, there are still open mathematical problems stemming from general relativity and from the mathematical descriptions of black holes and their rotation. The Laser Interferometer Gravitational-Wave Observatory (LIGO) project might produce data to stimulate these directions. In another direction, recent mathematical advances in understanding convergence properties of the Boltzmann equation might open the door to progress on some fundamental issues.

For some specific demands, the committee examined the 2003 NRC report *The Sun to the Earth—and Beyond: A Decadal Research Strategy in Solar and Space Physics*. This report poses several major challenges for that field that in turn pose associated challenges for the mathematical sciences. For example,

Challenge 1: Understanding the structure and dynamics of the Sun’s interior, the generation of solar magnetic fields, the origin of the solar cycle, the causes of solar activity, and the structure and dynamics of the corona.^{1}

This challenge will require advances in multiscale methods and complex simulations involving turbulence.

______________________

^{1} National Research Council, 2003. *The Sun to the Earth—and Beyond: A Decadal Research Strategy in Solar and Space Physics*. The National Academies Press, Washington, D.C., p. 2.

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 175

Appendix D
Examples of the Mathematical Sciences
Support of Science and Engineering
ILLUSTRATIVE DEMANDS FROM ASTRONOMY AND PHYSICS
The mathematical sciences continue to encounter productive challenges
from physics, and there may be new opportunities in the coming years. For
example, there are still open mathematical problems stemming from general
relativity and from the mathematical descriptions of black holes and their
rotation. The Laser Interferometer Gravitational-Wave Observatory (LIGO)
project might produce data to stimulate these directions. In another direc-
tion, recent mathematical advances in understanding convergence proper-
ties of the Boltzmann equation might open the door to progress on some
fundamental issues.
For some specific demands, the committee examined the 2003 NRC
report The Sun to the Earth—and Beyond: A Decadal Research Strategy
in Solar and Space Physics. This report poses several major challenges for
that field that in turn pose associated challenges for the mathematical sci-
ences. For example,
Challenge 1: Understanding the structure and dynamics of the Sun’s inte-
rior, the generation of solar magnetic fields, the origin of the solar cycle,
the causes of solar activity, and the structure and dynamics of the corona.1
This challenge will require advances in multiscale methods and com-
plex simulations involving turbulence.
1 NationalResearch Council, 2003. The Sun to the Earth—and Beyond: A Decadal Research
Strategy in Solar and Space Physics. The National Academies Press, Washington, D.C., p. 2.
175

OCR for page 175

176 APPENDIX D
Later, that same report identified the following additional challenges
that can only be addressed through related advances in the mathematical
sciences:
In the coming decade, the deployment of clusters of satellites and large
arrays of ground-based instruments will provide a wealth of data over a
very broad range of spatial scales. Theory and computational models will
play a central role, hand in hand with data analysis, in integrating these
data into first-principles models of plasma behavior. . . . The Coupling
Complexity research initiative will address multiprocess coupling, non-
linearity, and multiscale and multiregional feedback in space physics. The
program advocates both the development of coupled global models and
the synergistic investigation of well-chosen, distinct theoretical problems.
For major advances to be made in understanding coupling complexity in
space physics, sophisticated computational tools, fundamental theoretical
analysis, and state-of-the-art data analysis must all be integrated under a
single umbrella program.2
The coming decade will see the availability of large space physics
databases that will have to be integrated into physics-based numerical
models. . . . The solar and space physics community has not until recently
had to address the issue of data assimilation as seriously as have the
m
eteorologists. However, this situation is changing rapidly, particularly in
the ionospheric arena.3
Another example comes from the 2008 NRC report The Potential
Impact of High-End Capability Computing on Four Illustrative Fields of
Science and Engineering. This report identified the major research chal-
lenges in four disparate fields and the subset that depend on advances in
computing. Those advances in computing are innately tied to research in the
mathematical sciences. In the case of astrophysics, the report identified
the following essential needs:
• -body codes. Required to investigate the dynamics of collisionless
N
dark matter, or to study stellar or planetary dynamics. The mathemati-
cal model is a set of first-order ODEs for each particle, with accelera-
tion computed from the gravitational interaction of each particle with
all the others. Integrating particle orbits requires standard methods
for ODEs, with variable time stepping for close encounters. For the
gravitational acceleration (the major computational challenge), direct
summation, tree algorithms, and grid-based methods are all used to
compute the gravitational potential from Poisson’s equations.
2 Ibid., pp. 64-66.
3 Ibid., pp. 89-90.

OCR for page 175

APPENDIX D 177
• IC codes. Required to study the dynamics of weakly collisional, dilute
P
plasmas. The mathematical model consists of the relativistic equations
of motion for particles, plus Maxwell’s equations for the electric and
magnetic fields they induce (a set of coupled first-order PDEs). Stan-
dard techniques are based on particle-in-cell (PIC) algorithms, in which
M
axwell’s equations are solved on a grid using finite-difference ethods
m
and the particle motion is calculated by standard ODE integrators.
• luid dynamics. Required for strongly collisional plasmas. The math-
F
ematical model comprises the standard equations of compressible fluid
dynamics (the Euler equations, a set of hyperbolic PDEs), supplemented
by Poisson’s equation for self-gravity (an elliptic PDE), Maxwell’s equa-
tion for magnetic fields (an additional set of hyperbolic PDEs), and
the radiative transfer equation for photon or neutrino transport (a
high-dimensional parabolic PDE). A wide variety of algorithms for
fluid dynamics are used, including finite-difference, finite-volume, and
o
perator-splitting methods on orthogonal grids, as well as particle
methods that are unique to astrophysics—for example, SPH. To im-
prove resolution across a broad range of length scales, grid-based
methods often rely on static and adaptive mesh refinement (AMR).
The AMR methods greatly increase the complexity of the algorithm,
reduce the scalability, and complicate effective load-balancing yet are
absolutely essential for some problems.
• ransport problems. Required to calculate the effect of transport of
T
energy and momentum by photons or neutrinos in a plasma. The
mathematical model is a parabolic PDE in seven dimensions. Both
grid-based (characteristic) and particle-based (Monte Carlo) methods
are used. The high dimensionality of the problem makes first-principles
calculations difficult, and so simplifying assumptions (for example,
frequency-ndependent transport, or the diffusion approximation) are
i
usually required.
• icrophysics. Necessary to incorporate nuclear reactions, chemistry, and
M
ionization/recombination reactions into fluid and plasma simulations.
The mathematical model is a set of coupled nonlinear, stiff ODEs (or
algebraic equations if steady-state abundances are assumed) represent-
ing the reaction network. Implicit methods are generally required if the
ODEs are solved. Implicit finite-difference methods for integrating real-
istic networks with dozens of constituent species are extremely costly.4
In its look at atmospheric sciences, that same report identified the
followng necessary computational advances:
i
[Advancing the state of atmospheric science research requires] the develop-
ment of (1) scalable implementations of uniform-grid methods aimed at
4 NRC, 2008. The Potential Impact of High-End Capability Computing on Four Illustrative
Fields of Sciences and Engineering. The National Academies Press, Washington, D.C., p. 31.

OCR for page 175

178 APPENDIX D
the very highest performance and (2) a new generation of local refinement
methods and codes for atmospheric, oceanic, and land modeling. . . . The
propagation of uncertainty through a coupled model is particularly prob-
lematic, because nonlinear interactions can amplify the forced response of
a system. In addition, it is often the case that we are interested in bounding
the uncertainties in predictions of extreme, and hence rare, events, requiring
a rather different set of statistical tools than those to study means and vari-
ances of large ensembles. New systematic theories about multiscale, multi-
physics couplings are needed to quantify relationships better. This will be
important as atmospheric modeling results are coupled with economic and
impact models. Building a better understanding of coupling and the quan-
tification of uncertainties through coupled systems is necessary groundwork
for supporting the decisions that will be made based on modeling results.5
ILLUSTRATIVE DEMANDS FROM ENGINEERING
The 2008 National Academy of Engineering report Grand Challenges
for Engineering6 identified 14 major challenges. Following is a list of 11 of
those challenges that depend on corresponding advances in the mathemati-
cal sciences, along with thoughts about the particular mathematical science
research that will be needed.
• Make solar energy economical. This will require multiscale model-
ing of heterogeneous materials and better algorithms for modeling
quantum-scale behaviors, and the mathematical sciences will con-
tribute to both.
• Provide energy from fusion. This will require better methods for
simulating multiscale, complex behavior, including turbulent flows,
a topic challenging both mathematical scientists and domain scien-
tists and engineers.
• Develop carbon sequestration methods. This will require better
models of porous media and methods for modeling very large-scale
heterogenous and multiphysics systems.
• Advance health informatics. Requires statistical research to enable
more precise and tailored inferences from increasing amounts of
data.
• Engineer better medicines. Requires tools for bioinformatics and
simulation tools for modeling molecular interactions and cellular
machinery.
• Reverse-engineer the brain. Requires tools for network analysis,
models of cognition and learning, and signal analysis.
5 Ibid. p. 58.
6 From https://www.coursera.org/landing/hub.php.

OCR for page 175

APPENDIX D 179
• Prevent nuclear terror. Network analysis can contribute to this, as
can cryptology, data mining, and other intelligence tools.
• Secure cyberspace. Requires advances in cryptography and theo-
retical computer science.
• Enhance virtual reality. Requires improved algorithms for scene
rendering and simulation.
• Advance personalized learning. Requires advances in machine
learning.
• Engineer the tools of scientific discovery. Requires advances to
enable multiscale simulations, including improved algorithms, and
improved methods of data analysis.
Engineering in general is dependent on the mathematical sciences, and
that dependency is strengthening as engineers push toward ever greater
precision. As just one illustration of the pervasiveness of the mathematical
sciences in engineering, note the following examples of major improvements
in manufacturing that were provided at the 2011 annual meeting of the
National Academy of Engineering by Lawrence Burns, retired vice president
for R&D and strategic planning at General Motors Corporation:
1. Simultaneous engineering,
2. Design for manufacturing,
3. Math-based design and engineering (computer-aided design),
4. Six-sigma quality,
5. Supply chain management,
6. The Toyota production system, and
7. Life-cycle analysis.7
It is striking that six of the seven items are inextricably linked with the
mathematical sciences. While it is obvious that Item 3 depends on math-
ematical advances and Item 4 relies on statistical concepts and analyses,
Items 1, 2, 5, and 7 are all dependent on advances in simulation capabilities
that have enabled them to represent processes of ever-increasing complexity
and fidelity. Research in the mathematical sciences is necessary in order to
create those capabilities.
ILLUSTRATIVE DEMANDS FROM NETWORKING
AND INFORMATION TECHNOLOGY
The 2010 report from the President’s Council of Advisors on Science
and Technology (PCAST) Designing a Digital Future: Federally Funded
7 From http://www.nae.edu/Activities/Events/35831/51040/53248.aspx.

OCR for page 175

180 APPENDIX D
Research and Development in Networking and Information Technology8
identified four major recommendations for “initiatives and investments” in
networking and information technology (NIT) to “achieve America’s pri-
orities and advance key NIT research frontiers.” Three of those initiatives
are dependent on ongoing research in areas of the mathematical sciences.
First, the mathematical sciences are fundamental to advances in simulation
and modeling:
The federal government should invest in a national, long-term, multi-
agency, multi-faceted research initiative on NIT for energy and transporta-
tion. . . . Current research in the computer simulation of physical systems
should be expanded to include the simulation and modeling of proposed
energy-saving technologies, as well as advances in the basic techniques of
simulation and modeling.9
Second, the mathematical sciences underpin cryptography and provide
tools for systems analyses:
The federal government should invest in a national, long-term, multi-
agency research initiative on NIT that assures both the security and the
robustness of cyber-infrastructure
—o discover more effective ways to build trustworthy computing and
t
communications systems,
—o continue to develop new NIT defense mechanisms for today’s infra-
t
structure, and most importantly,
— o develop fundamentally new approaches for the design of the underly-
t
ing architecture of our cyber-infrastructure so that it can be made truly
resilient to cyber-attack, natural disaster, and inadvertent failure.10
Third, the mathematical sciences are strongly entwined with R&D for
privacy protection, large-scale data analysis, high-performance computing,
and algorithms:
The federal government must increase investment in those fundamental
NIT research frontiers that will accelerate progress across a broad range of
priorities . . . [including] research program on the fundamentals of privacy
protection and protected disclosure of confidential data . . . fundamental
8 PCAST, 2010, Report to the President and Congress: Designing a Digital Future: Federally
Funded Research and Development in Networking and Information Technology, p. xi. Avail-
able at http://www.whitehouse.gov/sites/default/files/microsites/ostp/pcast-nitrd-report-2010.
pdf.
9 Ibid., p. xi.
10 Ibid., p. xii.

OCR for page 175

APPENDIX D 181
research in data collection, storage, management, and automated large-
scale analysis based on modeling and machine learning . . . [and] continued
investment in important core areas such as high performance computing,
scalable systems and networking, software creation and evolution, and
algorithms.11
The report goes on to point out that five broad themes cut across
its recommendations, including the need for capabilities to exploit ever-
increasing amounts of data, to improve cybersecurity, and to better ensure
privacy. As noted above, progress on these fronts will require strong math-
ematical sciences research.
More generally, the mathematical sciences play a critical role in any
science or engineering field that involves network structures. In the early
years of U.S. telecommunications, very few networks existed, and each was
under the control of a single entity (such as AT&T). For more than 50 years
in this environment, relatively simple mathematical models of call traffic
were extremely useful for network management and planning. The world
is completely different today: It is full of diverse and overlapping networks,
including the Internet, the World Wide Web, wireless networks operated
by multiple service providers, and social networks, as well as networks
arising in scientific and engineering applications, such as networks that
describe intra- and intercellular processes in biology. Modern technologies
built around the Internet are consumers of mathematics: for example, many
innovations at Google, such as search, learning, and trend discovery, are
based on the mathematical sciences.
In settings ranging from the Internet to transportation networks to
global financial markets, interactions happen in the context of a complex
network.12 The most striking feature of these networks is their size and
global reach: They are built and operated by people and agents of diverse
goals and interests. Much of today’s technology depends on our ability
to successfully build and maintain systems used by such diverse set of
u
sers, ensuring that participants cooperate despite their diverse goals and
interests. Such large and decentralized networks provide amazing new op-
portunities for cooperation, but they also present large challenges. Despite
the enormous amount of data collected by and about today’s networks,
fundamental mathematical science questions about the nature, structure,
evolution, and security of networks remain that are of great interest for
the government, for innovation-driven businesses, and for the general
public.
11 Ibid.,
p. xiii.
12 The remainder of this section draws from pages 5, 6, and 10 of Institute for Pure and
Applied Mathematics, 2012, Report from the Workshop on Future Directions in athematics.
M
IPAM, Los Angeles.

OCR for page 175

182 APPENDIX D
For example, the sheer size of these networks makes it difficult to study
them. The interconnectedness of financial networks makes financial interac-
tions easy, but the financial crisis of 2008 provides a good example of the
dangers of that interconnectivity. Understanding to what extent networks
are susceptible to such cascading failures is an important area of study.
From a very different domain, the network structure of the Web is what
makes it most useful: Links provide the Web with a network structure and
help us navigate it. But linking is also an endorsement of some sort. This
network of implicit endorsements is what allows search companies such as
Google to effectively find useful pages. Understanding how to harness such
a large network of recommendations continues to be a challenge.
Game theory provides a mathematical framework that helps us under
stand the expected effects of interactions and to develop good design prin-
ciples for building and operating such networks. In this framework we
think of each participant as a player in a noncooperative game wherein each
player selects a strategy, selfishly trying to optimize his or her own objective
function. The outcome of the game for each participant depends, not only
on his own strategy, but also on the strategies chosen by all other players.
This emerging area is combining tools from many mathematical areas, in-
cluding game theory, optimization, and theoretical computer science. The
emergence of the Web and online social systems also gives graph theory an
important new application domain.
ILLUSTRATIVE DEMANDS FROM BIOLOGY
The 2009 NRC report A New Biology for the 21st Century13 had a
great deal to say about emerging opportunities that will rely on advances
in the mathematical sciences. Following are selected quotations:
• undamental understanding will require . . . computational modeling of
F
their growth and development at the molecular and cellular levels. . . .
The New Biology––integrating life science research with physical science,
engineering, computational science, and mathematics––will enable the
development of models of plant growth in cellular and olecular etail.
m d
Such predictive models, combined with a comprehensive approach to
cataloguing and appreciating plant biodiversity and the evolutionary
relationships among plants, will allow scientific plant breeding of a
new type, in which genetic changes can be targeted in a way that will
predictably result in novel crops and crops adapted to their conditions
of growth. . . . New quantitative methods—the methods of the New
Biology—are being developed that use next-generation DNA sequenc-
13 National Research Council, 2009, A New Biology for the 21st Century. The National
Academies Press, Washington, D.C.

OCR for page 175

APPENDIX D 183
ing to identify the differences in the genomes of parental varieties, and
to identify which genes of the parents are associated with particular
desired traits through quantitative trait mapping.14
• undamental advances in knowledge and a new generation of tools
F
and technologies are needed to understand how ecosystems function,
measure ecosystem services, allow restoration of damaged ecosystems,
and minimize harmful impacts of human activities and climate change.
What is needed is the New Biology, combining the knowledge base of
ecology with those of organismal biology, evolutionary and compara-
tive biology, climatology, hydrology, soil science, and environmental,
civil, and systems engineering, through the unifying languages of math-
ematics, modeling, and computational science. This integration has the
potential to generate breakthroughs in our ability to monitor ecosystem
function, identify ecosystems at risk, and develop effective interventions
to protect and restore ecosystem function.15
• ecent advances are enabling biomedical researchers to begin to study
R
humans more comprehensively, as individuals whose health is deter-
mined by the interactions between these complex structural and meta-
bolic networks. On the path from genotype to phenotype, each network
is interlocked with many others through intricate interfaces, such as
feedback loops. Study of the complex networks that monitor, report,
and react to changes in human health is an area of biology that is
poised for exponential development. . . . Computational and modeling
approaches are beginning to allow analysis of these complex systems,
with the ultimate goal of predicting how variations in individual com-
ponents affect the function of the overall system. Many of the pieces are
identified, and some circuits and interactions have been described, but
true understanding is still well beyond reach. Combining fundamental
knowledge with physical and computational analysis, modeling and
engineering, in other words, the New Biology approach, is going to be
the only way to bring understanding of these complex networks to a
useful level of predictability.16
• uch complex events as how embryos develop or how cells of the im-
S
mune system differentiate . . . must be viewed from a global yet detailed
perspective because they are composed of a collection of molecular
mechanisms that include junctions that interconnect vast networks of
genes. It is essential to take a broader view and analyze entire gene
regulatory networks, and the circuitry of events underlying complex
biological systems. . . . Analysis of developing and differentiating sys-
tems at a network level will be critical for understanding complex
events of how tissues and organs are assembled. . . . Similarly, networks
of proteins interact at a biochemical level to form complex metabolic
machines that produce distinct cellular products. Understanding these
14 Ibid., pp. 22-23.
15 Ibid., p. 25.
16 Ibid., pp. 34-35.

OCR for page 175

184 APPENDIX D
and other complex networks from a holistic perspective offers the pos-
sibility of diagnosing human diseases that arise from subtle changes in
network components.17
• erhaps the most complex, fascinating, and least understood networks
P
involve circuits of nerve cells that act in a coordinated fashion to pro-
duce learning, memory, movement, and cognition. . . . Understanding
networks will require increasingly sophisticated, quantitative technolo-
gies to measure intermediates and output, which in turn may demand
conceptual and technical advances in mathematical and computational
approaches to the study of networks.18
Later, the report discusses some specific ways in which the mathemati-
cal sciences are enabling the New Biology:
[M]athematical underpinnings of the field . . . embrace probabilistic and
combinatorial methods. Combinatorial algorithms are essential for solv-
ing the puzzles of genome assembly, sequence alignment, and phylogeny
construction based on molecular data. Probabilistic models such as Hidden
Markov models and Bayesian networks are now applied to gene find-
ing and comparative genomics. Algorithms from statistics and machine
learning are applied to genome-wide association studies and to prob-
lems of classification, clustering and feature selection arising in the analysis
of large-scale gene expression data.19
As another illustration, the 2008 NRC report The Potential Impact of
High-End Capability Computing on Four Illustrative Fields of Science and
Engineering20 identified essential challenges for evolutionary biology, a num-
ber of which are strongly mathematical. Following are selected quotations:
• tandard phylogenetic analysis comparing the possible evolutionary
S
relationships between two species can be done using the method of
maximum parsimony, which assumes that the simplest answer is the
best one, or using a model-based approach. The former entails counting
character change on alternative phylogenetic trees in order to find the
tree that minimizes the number of character transformations. The latter
incorporates specific models of character change and uses a minimiza-
tion criterion to choose among the sampled trees, which often involves
finding the tree with the highest likelihood. Counting, or optimizing,
change on a tree, whether in a parsimony or model-based framework,
is a computationally efficient problem. But sampling all possible trees to
17 Ibid., p. 35.
18 Ibid.
19 Ibid., p. 62.
20 NRC, 2008, The Potential Impact of High-End Capability Computing on Four Illustra-
tive Fields of Science and Engineering. The National Academies Press, Washington, D.C.

OCR for page 175

APPENDIX D 185
find the optimal solution scales precipitously with the number of taxa
(or sequences) being analyzed. . . . Thus, it has long been appreciated
that finding an exact solution to a phylogenetic problem of even mod-
erate size is NP complete. . . . Accordingly, numerous algorithms have
been introduced to search heuristically across tree space and are widely
employed by biological investigators using platforms that range from
desktop workstations to supercomputers. These algorithms include
methods [such as] simulated annealing, genetic (evolutionary) algo-
rithmic searches, and Bayesian Markov chain Monte Carlo (MCMC)
approaches.21
• he accumulation of vast amounts of DNA sequence data and our
T
expanding understanding of molecular evolution have led to the devel
opment of increasingly complex models of molecular evolutionary
change. As a consequence, the enlarged parameter space required by
these molecular models has increased the computational challenges
confronting phylogeneticists, particularly in the case of data sets that
combine numerous genes, each with their own molecular dynamics. . . .
Phylogeneticists are more and more concerned about having statistically
sound measures of estimating branch support. In model-based ap-
proaches, in particular, such procedures are computationally intensive,
and the model structure scales significantly with the size of the number
of taxa and the heterogeneity of the data. In addition, more attention
is being paid to statistical models of molecular evolution.22
• serious major computational challenge is to generate qualitative and
A
quantitative models of development as a necessary prelude to applying
sophisticated evolutionary models to understand how developmental
processes evolve. Developmental biologists are just beginning to cre-
ate the algorithms they need for such analyses, based on relatively
simple reaction-rate equations, but progress is rapid. . . . Another
important breakthrough in the field is the analysis of gene regulatory
networks [which] describe the pathways and interactions that guide
development.23
• odern sequencing technologies routinely yield relatively short frag-
M
ments of a genomic sequence, from 25 to 1,000 [base pairs]. Whole
genomes range in size from the typical microbial sequence, which has
millions of base pairs, to plant and animal sequences, which often
consist of billions of base pairs. Additionally, ‘metagenomic’ sequenc-
ing from environmental samples often mixes fragments from dozens to
hundreds of different species and/or ecotypes. The challenge is to take
these short subsequences and assemble them to reconstruct the genomes
of species and/or ecosystems. . . . While the fragment assembly prob-
lem is NP-complete, heuristic algorithms have produced high-quality
reconstructions of hundreds of genomes. The recent trend is toward
21 Ibid., p. 65.
22 Ibid.
23 Ibid., p. 72.

OCR for page 175

186 APPENDIX D
methods of sequencing that can inexpensively generate large numbers
(hundreds of millions) of ultrashort sequences (25-50 bp). Technical
and algorithmic challenges include the following:
— arallelization of all-against-all fragment alignment computations.
P
— evelopment of methods to traverse the resulting graphs of fragment
D
alignments to maximize some feature of the assembly path.
— euristic pruning of the fragment alignment graph to eliminate ex-
H
perimentally inconsistent subpaths.
— ignal processing of raw sequencing data to produce higher quality
S
fragment sequences and better characterization of their error profiles.
— evelopment of new representations of the sequence-assembly
D
p
roblem—for example, string graphs that represent data and assem-
bly in terms of words within the dataset.
— lignment of error-prone resequencing data from a population of
A
individuals against a reference genome to identify and characterize
individual variations in the face of noisy data.
— emonstration that the new methodologies are feasible, by produc-
D
ing and analyzing suites of simulated data sets.24
• O
nce we have a reconstructed genomic or metagenomic sequence, a
further challenge is to identify and characterize its functional elements:
protein-coding genes; noncoding genes, including a variety of small
RNAs; and regulatory elements that control gene expression, splicing,
and chromatin structure. Algorithms to identify these functional regions
use both statistical signals intrinsic to the sequence that are character-
istic of a particular type of functional region and comparative analyses
of closely and/or distantly related sequences. Signal-detection methods
have focused on hidden Markov models and variations on them. Sec-
ondary structure calculations take advantage of stochastic, context-free
grammars to represent long-range structural correlations.25
• C
omparative methods require the development of efficient alignment
methods and sophisticated statistical models for sequence evolution
that are often intended to quantitatively model the likelihood of a
detected alignment given a specific model of evolution. While earlier
models treated each position independently, as large data sets became
available the trend is now to incorporate correlations between sites. To
compare dozens of related sequences, phylogenetic methods must be
integrated with signal detection.26
ILLUSTRATIVE DEMANDS FROM MEDICINE
There is increasing awareness in medicine of the benefits from collabo-
ration with mathematical scientists, including in areas where past focus has
24 Ibid., pp. 77-78.
25 Ibid.
26 Ibid.

OCR for page 175

APPENDIX D 187
tended to be clinical and to have little reference to mathematics. In areas
ranging from medical imaging, drug discovery, discovery of genes linked
to hereditary diseases, personalized medicine, validating treatments, cost-
benefit analysis, and robotic surgery, the mathematical sciences are ever-
present. The role of the mathematical sciences in medicine has become so
important that one can provide only a few examples.
One example is a program of the Office of Physcal Sciences Oncology,27
funded by the National Cancer Institute, to support innovative approaches
to understanding and controlling cancer, which is explicitly seeking teams
that may include mathematical scientists. Some of the associated mathemat-
ical science questions arise in creation of sophisticated models of cancer,
optimization of cancer chemotherapy regimes, and rapid incorporation of
data into models and treatment. (As shown in Appendix C, the National
Institutes of Health overall allocates some $90 million per year for math-
ematical sciences research.) As another example, looking just at the Febru-
ary 2011 issue of Physical Biology, one sees articles whose titles contain
the following mathematics-related phrases: “stochastic dynamics,” “micro-
mechanics,” “an evolutionary game theoretical view,” and “an Ising model
of cancer and beyond.” In the same vein, a research highlight published in
Cancer Discovery in July 2011 describes development of “a statistical ap-
proach that maps out the order in which . . . [cancer-related] abnormalities
arise.” It is increasingly common to find such cross-fertilization.
RESTORING REPRODUCIBILITY IN SCIENCE
A recent article in the Wall Street Journal opens with the following
story:
Two years ago, a group of Boston researchers published a study describ-
ing how they had destroyed cancer tumors by targeting a protein called
STK33. Scientists at biotechnology firm Amgen Inc. quickly pounced on
the idea and assigned two dozen researchers to try to repeat the experiment
with a goal of turning the findings into a drug. It proved to be a waste of
time and money. After six months of intensive lab work, Amgen found it
couldn’t replicate the results and scrapped the project. ‘I was disappointed
but not surprised,’ says Glenn Begley, vice president of research at Amgen
of Thousand Oaks, Calif. ‘More often than not, we are unable to repro-
duce findings’ published by researchers in journals.28
27 The
Office of Physical Sciences in Oncology; http://physics.cancer.gov/about/summary.asp.
28
Gautam Naik, Scientists’ elusive goal: Reproducing study results. Wall Street Journal,
December 2, 2011.

OCR for page 175

188 APPENDIX D
A related paper29 reports on a study done by Bayer Healthcare that
examined 67 published research articles for which Bayer had attempted
to replicate the findings in-house. Fewer than one-quarter were viewed as
having been essentially replicated, while more than two-thirds exhibited
major inconsistencies, in most cases leading Bayer to terminate the projects.
One reason for this lack of reproducibility is undoubtedly the pres-
sure to publish and find positive results. Under this pressure, contradictory
evidence can be swept under the carpet with rationalizations about why
such evidence may safely be ignored. And, of course, some experiments
will simply have been done wrong, having undetected sources of bias (e.g.,
contaminated materials).
Another reason for the lack of reproducibility is unsound statistical
analysis. Causes range from scientifically improper selection of data (throw-
ing out data that do not support the claim) to a weak understanding of even
the simplest statistical methods. As an example of the latter, Nieuwenhuis et
al.30 reviewed 513 behavioral, systems, and cognitive neuroscience articles
in five top-ranking journals, looking for articles that specifically compared
two experimental effects to see if they differed significantly. Out of the 157
articles that focused on such a comparison, they found that 78 used the
correct statistical procedure of testing whether the difference of the two
effects is significantly different from 0; 79 used a blatantly incorrect pro-
cedure. The incorrect procedure was to separately test each experimental
effect against a null hypothesis of no effect and, if one effect was signifi-
cantly different from zero but the other was not, to declare that there was
a significant difference between the effects. Of course, this can happen even
when the two effects are essentially identical (e.g., if the p-value for one is
0.06 and the p-value for the other is 0.05).
Ensuring sound statistical analysis has long been a problem for science,
and this problem is only getting worse owing to the increasingly massive
data available today which allows increasingly thorough exploration of
the data for “artifacts.” These artifacts can be exciting new science or they
can simply arise from noise in the data. The worry these days is that most
often it is the latter.
In a recent talk about the drug discovery process (which the commit-
tee chooses not to cite in order to avoid embarrassing the presenter), the
followng numbers were given in illustration:
i
29 Florian Prinz, Thomas Schlange, and Khusru Asadullah, 2011, Believe it or not: How
much can we rely on published data on potential drug targets? Nature Reviews Drug Dis-
covery 10:712.
30 Sander Nieuwenhuis, Birte U Forstmann, and Eric-Jan Wagenmakers, 2011, Erroneous
analyses of interactions in neuroscience: a problem of significance. Nature Neuroscience 14:
1105-1107.

OCR for page 175

APPENDIX D 189
• Ten thousand relevant compounds were screened for biological
activity.
• Five hundred passed the initial screen and were subjected to in
vitro experiments.
• Twenty-five passed this screening and were studied in Phase I ani-
mal trials.
• One passed this screening and was studied in an expensive Phase II
human trial.
These numbers are completely compatible with the presence of nothing
but noise, assuming the screening was done based on statistical significance
at the 0.05 level, as is common. (Even if none of the 10,000 compounds
has any effect, roughly 5 percent of the compounds would appear to have
an effect at the first screening; roughly 5 percent of the 500 screened com-
pounds would appear to have an effect at the second screening, etc.)
This problem, often called the “multiplicity” or “multiple testing”
problem, is of central importance not only in drug development, but also in
microarray and other bioinformatic analyses, syndromic surveillance, high-
energy physics, high-throughput screening, subgroup analysis, and indeed
any area of science facing an inundation of data (which most of them are).
The section of Chapter 2 on high-dimensional data indicates the major
advances happening in statistics and mathematics to meet the challenge of
multiplicity and to help restore the reproducibility of science.