Historical Successes

To set the stage for recommendations dealing with tomorrow’s productive interaction between the mathematical sciences and biology, this chapter briefly describes some successful interactions of the past. While it is common to hear that biology is only now becoming mathematical, in fact there has been overlap between the two fields for a long time (Cohen, 2004). What is different now is that biology routinely relies on methods from the mathematical sciences to assay and manipulate data and that computational methods have now become powerful enough to model the complexity of biological entities and systems.

R.A. Fisher, a mathematician trained at Cambridge University, became interested in biology at a crucial time. At the beginning of the 20th century, biologists had rediscovered Mendel’s work, and one of their main challenges was to reconcile it with Darwin’s theory of evolution. Fisher is one of the scientists credited with ushering in the new era that merged genetics and evolution, sometimes referred to as neo-Darwinian theory, through his work that helped establish the field of population biology. He published several papers on the topic, and his 1930 book *The Genetical Theory of Natural Selection* stands as a landmark of that era. His work demonstrated that statistics is a natural tool for modeling populations.

It is equally interesting to consider how biological data led Fisher to revolutionize the field of statistics. He joined the Rothamsted Experimental Station to apply statistical methods to the mass of data that had been

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 29

2
Historical Successes
To set the stage for recommendations dealing with tomorrow’s pro-
ductive interaction between the mathematical sciences and biology, this
chapter briefly describes some successful interactions of the past. While it
is common to hear that biology is only now becoming mathematical, in
fact there has been overlap between the two fields for a long time (Cohen,
2004). What is different now is that biology routinely relies on methods
from the mathematical sciences to assay and manipulate data and that
computational methods have now become powerful enough to model the
complexity of biological entities and systems.
THE BEGINNINGS OF POPULATION BIOLOGY
R.A. Fisher, a mathematician trained at Cambridge University, be-
came interested in biology at a crucial time. At the beginning of the 20th
century, biologists had rediscovered Mendel’s work, and one of their main
challenges was to reconcile it with Darwin’s theory of evolution. Fisher is
one of the scientists credited with ushering in the new era that merged
genetics and evolution, sometimes referred to as neo-Darwinian theory,
through his work that helped establish the field of population biology. He
published several papers on the topic, and his 1930 book The Genetical
Theory of Natural Selection stands as a landmark of that era. His work dem-
onstrated that statistics is a natural tool for modeling populations.
It is equally interesting to consider how biological data led Fisher to
revolutionize the field of statistics. He joined the Rothamsted Experimen-
tal Station to apply statistical methods to the mass of data that had been
29

OCR for page 29

30 MATHEMATICS AND 21ST CENTURY BIOLOGY
accumulated over many years on field trials of crops. He found that the
tools available were inadequate to the task. One review of his work (Yates
and Mather, 1963) describes his contribution:
While at Rothamsted not only did he recast the whole theoretical basis of
mathematical statistics, he also developed the modern techniques of the
design and analysis of experiments, and was prolific in devising meth-
ods to deal with the many and varied problems with which he was con-
fronted by research workers at Rothamsted and elsewhere.
His 1925 book Statistical Methods for Research Workers, which intro-
duced analysis of variance (ANOVA) methods to statistics, was a revolu-
tionary advance. “Fisher had by that time also established a rigorous
framework for maximum-likelihood methods, which continue to play a
central role in statistical inference,” according to Aldrich (1997). In 1935
Fisher published The Design of Experiments, which was the first book de-
voted to that subject. Fisher had a significant impact on both biology and
mathematical statistics, and his contributions affected the theory and prac-
tice of both.
INFERENCE OF GENE FUNCTION BY HOMOLOGY
In the modern world of biology, where sequences of entire genomes
are available and the number of such sequences is growing rapidly, one
sees the enormous importance of mathematical and computer science
methods in advancing biological knowledge. Algorithms are essential at
many stages, from finding overlaps of short, noisy sequence strings, to
assembling them into complete chromosomes, and to identifying regions
that are likely to code for proteins or carry out other genetic functions.
One of the most important tasks is the inference of a protein’s func-
tion. There are close to 1 million different known and predicted proteins
in living organisms. Two proteins are said to be homologous if their simi-
larity is due to common ancestry—that is, if they were generated from the
same gene in the genome of an ancestral species at one time in the evolu-
tionary past and their sequences have been sufficiently conserved since
that time so that they are still recognizably similar. The number of pro-
teins that have had their functions determined experimentally is, at most,
in the tens of thousands, meaning that the functions of over 90 percent of
all the proteins in our databases are inferred from homology. In some
cases this is easy to do. For example, if one protein has its function deter-
mined experimentally and another protein is discovered with a nearly
identical sequence, then it is an easy, and quite reliable, extrapolation to
assign the same function to the new protein. But, if the sequences of two
proteins differ substantially, it is less clear whether they are really ho-

OCR for page 29

31
HISTORICAL SUCCESSES
mologous to each other. If they are homologous, they are likely to have
the same, or a closely related, function, although there are exceptions.
Inferring that two proteins are homologous when they are far from iden-
tical in amino acid sequence and locating their related sequences in a com-
plete genomic sequence requires the application of mathematical and com-
putational methods that have been developed over the last 40 years.
In the 1960s, Emile Zuckerkandl and Linus Pauling (1965) first real-
ized that DNA and protein sequences, molecules they called “semantides”
(for information-carrying polymers) contain the history of their diver-
gence from their ancestors. From the information in genetic sequences,
one could, they argued, do “paleogenetics” to find the relationships be-
tween genes and therefore also between species. This became the field of
molecular evolution, which has flourished and become ever more math-
ematical as the sophistication of the models for evolutionary change has
increased and more complex algorithms have become common for infer-
ring evolutionary events and phylogenetic trees. Also in the 1960s, Motoo
Kimura (1968) introduced the theory of neutral evolution and its large
contribution to sequence divergence. One effect of his work was simply to
emphasize the enormous amount of change that can be observed in bio-
logical sequences, which makes paleogenetics that much more challeng-
ing, because it means large differences in sequence can accumulate with-
out changes in function. Margaret Dayhoff (1965) produced the First Atlas
of Protein Sequence and Structure. Among other things, the atlas allowed
her to analyze the substitutions observed in closely related proteins and
obtain an empirically derived estimate for the rate of substitution of one
amino acid for any other. The resulting percentage accepted mutations
(PAM) matrices were a much improved measure of the similarity between
protein sequences.
To identify the changes between two proteins, one has to find the
correct, or at least an optimal, alignment between them. If they are very
different, it is not easy to obtain the optimal alignment. Methods referred
to generally as dynamic programming, developed by Richard Bellman in
1953, can obtain optimal alignments in such cases very efficiently.
Needleman and Wunsch (1970) first published a dynamic programming
algorithm to find the optimum alignment between two biological se-
quences, and over the next several years several variations of that method
were developed, differing in how the alignments were scored and how
they treated gaps. Most of the efforts were directed at global alignments,
where both sequences are aligned along their entire lengths. The more
challenging problem was to find local alignments, where only a portion
of the two sequences has significant similarity. Local alignment is needed
to compare genomes with each other, or even to ask whether a homo-
logue of a particular gene occurs within a genome sequence. Smith and

OCR for page 29

32 MATHEMATICS AND 21ST CENTURY BIOLOGY
Waterman (1981) solved the problem of identifying the local alignment
using dynamic programming in a way that allows for full use of similar-
ity matrices such as PAM, treats gaps in an intelligent way, and is guar-
anteed to find the optimal solution efficiently.
While the Smith-Waterman algorithm solves the problem of optimal
local alignments, it is not efficient enough for the very large database
searches that were becoming necessary by the mid- to late 1980s owing to
an exponential increase in database size. The BLAST program, published
in 1990, was a major breakthrough (Altschul et al., 1990). This was an
important collaboration between two computer scientists (Myers and
Miller), a mathematician (Altschul), a medical doctor (Lipman), and a bi-
ologist (Gish). It employed a fast heuristic search algorithm for the local
alignment problem, and the algorithm’s sensitivities are not much reduced
from those of the Smith-Waterman algorithm. At the same time, Altschul
and Karlin, another mathematician, developed statistical methods to al-
low computing the significance of the matches found by BLAST (Karlin
and Altschul, 1990). When the National Institutes of Health (NIH) made
the program available over the newly arrived Internet, biologists around
the world suddenly had access to sophisticated database searches to com-
pare their own sequences with the known sequences. Just as the large-
scale genome sequencing projects were being contemplated, but before
they had truly begun in earnest, this critical piece of software had been
developed that would greatly expedite the projects.
EVOLUTIONARY PROCESSES IN POPULATIONS
In the early 1980s population genetics theory took a dramatic turn.
Before that time, most theoretical work was focused on the analysis of
allele frequencies for two, or perhaps a few, variants in just one or two
genes. Interest focused on the frequency spectrum or heterozygosity that
one observed at enzyme loci, using protein gel electrophoresis to assay
variation. The theory was based primarily on diffusion approximations of
two-allele systems (higher-dimensional systems being intractable). When
it became clear that surveys of DNA polymorphism would become avail-
able via resequencing, it was obvious that some different quantities would
need to be studied. With this new kind of data, one would know not only
the number of alleles and their frequencies in a sample but also, from long
stretches of linked sites, the number of mutational steps by which all the
sequences differed from each other. This opened a new window on the
evolutionary processes occurring in populations.
To understand the variation revealed by sequencing of alleles, several
investigators at that time began focusing on the distributional properties
of gene trees. Gene trees, under the standard finite-population-size mod-

OCR for page 29

33
HISTORICAL SUCCESSES
els, are random structures. The early work, spearheaded by Kingman
(1982) and Tajima (1983), showed that many properties of sequence varia-
tion could be simply understood in terms of the properties of the genea-
logical tree relating the sampled sequences. Many other players quickly
entered the arena. This area of population genetics became known as coa-
lescent theory. The early theory dealt with the simplest models, which
had constant population size, no spatial structure, no recombination, and
no selection. Over the next 15 years the theory was generalized to cover
models in which all of these limitations had been removed. It is now rou-
tine to think about genetic variation in terms of the size and shape of gene
trees that relate sampled sequences. It is also routine to simulate samples
under many models using efficient algorithms based on the coalescent
approach. The models of the coalescent process are often very simple to
describe, consisting of relatively straightforward Markov chains, but the
genealogical structures that arise are in some cases surprisingly challeng-
ing to analyze and rich in their connections with other areas of stochastic
processes. For example, coalescent methods for models with selection
have connections with the theory of interacting particles and dual pro-
cesses (Krone and Neuhauser, 1997).
The coalescent approach has led to new insights about the models, to
new analytic results (Krone and Neuhauser, 1997), to numerical methods
for obtaining likelihoods (Kuhner et al., 2000; Griffiths and Tavaré, 1995),
and to very efficient simulation algorithms (Hudson, 1983).
MODELING
As a further illustration of successful interactions between mathemat-
ics and biology, consider two separate historical examples of mathemati-
cal modeling. The development of the Hodgkin-Huxley equations
(Hodgkin and Huxley, 1952) to describe the evolution of action potentials
was of profound importance. Their description of ionic currents through
ion selective channels provided a paradigm that is still used extensively
today in models of cellular electrophysiology. The understanding of ex-
citability that came from their model is also of remarkably general appli-
cability. Perhaps more significant, however, was the recognition that spa-
tially extended systems of excitable components could support waves of
invariant form and allow for robust signaling over great distances.
It is now understood that this combination of excitability over spa-
tially extended networks provides the basis for communication and con-
trol of many fundamental biological processes. Communication along
one-dimensional excitable pathways is remarkably robust and reliable.
Yet, in two- and three-dimensional spatially extended networks, other
robust patterns (e.g., re-entrant spirals) can arise that overrun the normal

OCR for page 29

34 MATHEMATICS AND 21ST CENTURY BIOLOGY
function and lead to serious pathology. As a consequence, there is cur-
rently a significant effort to understand how to prevent these other natu-
rally occurring, but pathological, patterns and how to get rid of them when
they do occur.
A second illustrative example of modeling success is the suggestion
of the existence of dendro-dendritic synapses in the olfactory bulb by Rall
and Shepherd (1968). It was widely believed, before that time, that axons
were excitable, dendrites were passive, and synaptic connections were
made between axons and dendrites only. In order to match certain extra-
cellular potentials that were measured experimentally in the olfactory
bulb, Rall and Shepherd found that a compartment model with active
dendrites was required. As a result, they suggested that dendro-dendritic
pathways were likely to exist, and that they provided a novel mechanism
for recurrent inhibition. It was only years later that experimental tech-
nique improved to the point where such synapses were indeed found to
exist. This is an example of the healthy interplay between modeling and
experiment, wherein each drives the other in the pursuit of more com-
plete understanding.
MEDICAL AND BIOLOGICAL IMAGING
The last few decades have seen dramatic advances in imaging tech-
nology. In medicine, magnetic resonance imaging (MRI) and computed
x-ray tomography (CT) are playing increasingly important roles in both
diagnosis and treatment, with new applications emerging every year. The
importance of the mathematical sciences to biomedical imaging was em-
phasized in the 1996 National Research Council report Mathematics and
Physics of Emerging Biomedical Imaging:
While exponential improvements in computing power have contributed
to the development of today’s biomedical imaging capabilities, comput-
ing power alone does not account for the dramatic expansion of the field,
nor will future improvements in computer hardware be a sufficient
springboard to enable the development of the biomedical imaging tools
described in this report. That development will require continued re-
search in physics and the mathematical sciences, fields that have contrib-
uted greatly to biomedical imaging and will continue to do so. (p. 9)
For example, the mathematical foundations for image reconstruction
in x-ray CT date back to the work of Johann Radon in the early 1900s. It
was around 1970, however, that machines first provided images of value
in medical diagnosis, mainly owing to the efforts of A.M. Cormack and
G.N. Hounsfield. They observed that by measuring the net attenuation

OCR for page 29

35
HISTORICAL SUCCESSES
along large numbers of individual x-ray pencil beams, one could recon-
struct the attenuation coefficient point by point across a complete cross
section of the human body. Nevertheless, several hours of computation
time were required to obtain a single image, and the quality was rela-
tively poor. The original numerical methods for reconstruction were based
on iterative relaxation of a system of equations, with each equation repre-
senting the discretization of an integral measuring the net attenuation
along a single line. When Shepp and Logan (1974) and others introduced
filtered back projection, it was possible to substantially improve both im-
age quality and reconstruction time, and CT scans became much more
practical.
For the purposes of this chapter, it is worth highlighting a very spe-
cific mathematical contribution and discussing its ramifications in the
context of computational biology more broadly. That contribution was
the introduction of the mathematical “phantom” by Shepp and Logan
(1974) and Shepp and Kruskal (1978). Consider the situation where one is
seeking to compare the performance of a variety of reconstruction meth-
ods. The standard approach before Shepp and Logan’s work had been to
create actual physical models with known characteristics, from which
data were measured and reconstruction performed. This seems natural,
but errors may stem from inaccuracies in performing the physical experi-
ment as well as in the reconstruction. The Shepp-Logan phantom is a
mathematically defined function from which exact (artificial) data can be
created, including any desired noise model. Its importance was made
clear when Shepp and Logan turned their attention to a ring of high den-
sity slightly inside the skull that was observed when the first CT machine
was introduced and that was believed to be a previously unrecognized
anatomic feature. The use of mathematical phantoms was instrumental
in showing that this ring was in fact an artifact of the reconstruction
algorithm.
SUMMARY
This chapter gives some indication—but certainly not an exhaustive
account—of the long history of interaction between the mathematical and
biological sciences. It also demonstrates, by example, the breadth of that
interaction: Many areas of biology have been affected by many areas of
mathematical science, and the challenges of biology have also prompted
advances of importance to the mathematical sciences themselves. Some-
times the benefits of mathematical sciences research have been direct, and
sometimes they have arisen in ways that were not predicted. As these
examples show, the right mathematical approach can have a dramatic

OCR for page 29

36 MATHEMATICS AND 21ST CENTURY BIOLOGY
impact on whether or not a particular biological construct is feasible—for
example, in the case of finding protein homologues or reconstructing CT
scans—and developing the right mathematical representation of a phe-
nomenon can enable very productive research—for example, in the study
of populations or exploring signaling mechanisms.
REFERENCES
Aldrich, J. 1997. R.A. Fisher and the making of maximum likelihood 1912-1922. Statist. Sci.
12: 162-176.
Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. 1990. Basic local alignment
search tool. J. Mol. Biol. 215(3): 403-410.
Cohen, J.E. 2004. Mathematics is biology’s next microscope, only better; biology is math-
ematics’ next physics, only better. PLoS Biol. 2(12): e439.
Dayhoff, M.O., R.V. Eck, M.A. Chang, and M.R. Sochard. 1965. Atlas of Protein Sequence and
Structure. Silver Spring, Md.: National Biomedical Research Foundation.
Fisher, R.A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd.
Fisher, R.A. 1930. The Genetical Theory of Natural Selection. Oxford: Clarendon Press.
Fisher, R.A. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd.
Griffiths, R.C., and S. Tavaré. 1995. Unrooted genealogical tree probabilities in the infinitely-
many-sites model. Math. Biosci. 127(1): 77-98.
Hodgkin, A.L., and A.F. Huxley. 1952. A quantitative description of membrane current and
its application to conduction and excitation in nerve. J. Physiol. 117: 500-544.
Hudson, R.R. 1983. Properties of a neutral allele model with intragenic recombination. Theor.
Popul. Biol. 23(2): 183-201.
Karlin, S., and S.F. Altschul. 1990. Methods for assessing the statistical significance of mo-
lecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A.
87(6): 2264-2268.
Kimura, M. 1968. Evolutionary rate at the molecular level. Nature 217(129): 624-626.
Kingman, J.F.C. 1982. On the genealogy of large populations. J. Appl. Prob. 19A: 27-43.
Krone, S.M., and C. Neuhauser. 1997. Ancestral processes with selection. Theor. Popul. Biol.
51(3): 210-237.
Kuhner, M.K., J. Yamato, and J. Felsenstein. 2000. Maximum likelihood estimation of recom-
bination rates from population data. Genetics 156(3): 1393-1401.
National Research Council. 1996. Mathematics and Physics of Emerging Biomedical Imaging.
Washington, D.C.: National Academy Press.
Needleman, S.B., and C.D. Wunsch. 1970. A general method applicable to the search for
similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3): 443-453.
Rall, W., and G.M. Shepherd. 1968. Theoretical reconstruction of field potentials and
dendrodentritic synapse interactions in olfactory bulb. J. Neurophysiol. 31: 884-915.
Shepp, L.A., and J.B. Kruskal. 1978. Computerized tomography: The new medical x-ray tech-
nology. Am. Math. Mon. XX: 420-439.
Shepp, L.A., and B.F. Logan. 1974. The Fourier reconstruction of a head section. IEEE Trans.
Nucl. Sci. 21: 21-43.
Smith, T.F., and M.S. Waterman. 1981. Identification of common molecular subsequences. J.
Mol. Biol. 147(1):195-197.
Tajima, F. 1983. Evolutionary relationships of DNA sequences in finite populations. Genetics
105: 437-460.

OCR for page 29

37
HISTORICAL SUCCESSES
Yates, F., and K. Mather. 1963. Ronald Aylmer Fisher 1890-1962. Pp. 91-120 in Biographical
Memoirs of Fellows of the Royal Society of London, Vol. 9. London, England: The Royal
Society.
Zuckerkandl, E., and L. Pauling. 1965. Molecules as documents of evolutionary history. J.
Theor. Biol. 8: 357-366.