To set the stage for recommendations dealing with tomorrow’s productive interaction between the mathematical sciences and biology, this chapter briefly describes some successful interactions of the past. While it is common to hear that biology is only now becoming mathematical, in fact there has been overlap between the two fields for a long time (Cohen, 2004). What is different now is that biology routinely relies on methods from the mathematical sciences to assay and manipulate data and that computational methods have now become powerful enough to model the complexity of biological entities and systems.
THE BEGINNINGS OF POPULATION BIOLOGY
R.A. Fisher, a mathematician trained at Cambridge University, became interested in biology at a crucial time. At the beginning of the 20th century, biologists had rediscovered Mendel’s work, and one of their main challenges was to reconcile it with Darwin’s theory of evolution. Fisher is one of the scientists credited with ushering in the new era that merged genetics and evolution, sometimes referred to as neo-Darwinian theory, through his work that helped establish the field of population biology. He published several papers on the topic, and his 1930 book The Genetical Theory of Natural Selection stands as a landmark of that era. His work demonstrated that statistics is a natural tool for modeling populations.
It is equally interesting to consider how biological data led Fisher to revolutionize the field of statistics. He joined the Rothamsted Experimental Station to apply statistical methods to the mass of data that had been
accumulated over many years on field trials of crops. He found that the tools available were inadequate to the task. One review of his work (Yates and Mather, 1963) describes his contribution:
While at Rothamsted not only did he recast the whole theoretical basis of mathematical statistics, he also developed the modern techniques of the design and analysis of experiments, and was prolific in devising methods to deal with the many and varied problems with which he was confronted by research workers at Rothamsted and elsewhere.
His 1925 book Statistical Methods for Research Workers, which introduced analysis of variance (ANOVA) methods to statistics, was a revolutionary advance. “Fisher had by that time also established a rigorous framework for maximum-likelihood methods, which continue to play a central role in statistical inference,” according to Aldrich (1997). In 1935 Fisher published The Design of Experiments, which was the first book devoted to that subject. Fisher had a significant impact on both biology and mathematical statistics, and his contributions affected the theory and practice of both.
INFERENCE OF GENE FUNCTION BY HOMOLOGY
In the modern world of biology, where sequences of entire genomes are available and the number of such sequences is growing rapidly, one sees the enormous importance of mathematical and computer science methods in advancing biological knowledge. Algorithms are essential at many stages, from finding overlaps of short, noisy sequence strings, to assembling them into complete chromosomes, and to identifying regions that are likely to code for proteins or carry out other genetic functions.
One of the most important tasks is the inference of a protein’s function. There are close to 1 million different known and predicted proteins in living organisms. Two proteins are said to be homologous if their similarity is due to common ancestry—that is, if they were generated from the same gene in the genome of an ancestral species at one time in the evolutionary past and their sequences have been sufficiently conserved since that time so that they are still recognizably similar. The number of proteins that have had their functions determined experimentally is, at most, in the tens of thousands, meaning that the functions of over 90 percent of all the proteins in our databases are inferred from homology. In some cases this is easy to do. For example, if one protein has its function determined experimentally and another protein is discovered with a nearly identical sequence, then it is an easy, and quite reliable, extrapolation to assign the same function to the new protein. But, if the sequences of two proteins differ substantially, it is less clear whether they are really ho-
mologous to each other. If they are homologous, they are likely to have the same, or a closely related, function, although there are exceptions. Inferring that two proteins are homologous when they are far from identical in amino acid sequence and locating their related sequences in a complete genomic sequence requires the application of mathematical and computational methods that have been developed over the last 40 years.
In the 1960s, Emile Zuckerkandl and Linus Pauling (1965) first realized that DNA and protein sequences, molecules they called “semantides” (for information-carrying polymers) contain the history of their divergence from their ancestors. From the information in genetic sequences, one could, they argued, do “paleogenetics” to find the relationships between genes and therefore also between species. This became the field of molecular evolution, which has flourished and become ever more mathematical as the sophistication of the models for evolutionary change has increased and more complex algorithms have become common for inferring evolutionary events and phylogenetic trees. Also in the 1960s, Motoo Kimura (1968) introduced the theory of neutral evolution and its large contribution to sequence divergence. One effect of his work was simply to emphasize the enormous amount of change that can be observed in biological sequences, which makes paleogenetics that much more challenging, because it means large differences in sequence can accumulate without changes in function. Margaret Dayhoff (1965) produced the First Atlas of Protein Sequence and Structure. Among other things, the atlas allowed her to analyze the substitutions observed in closely related proteins and obtain an empirically derived estimate for the rate of substitution of one amino acid for any other. The resulting percentage accepted mutations (PAM) matrices were a much improved measure of the similarity between protein sequences.
To identify the changes between two proteins, one has to find the correct, or at least an optimal, alignment between them. If they are very different, it is not easy to obtain the optimal alignment. Methods referred to generally as dynamic programming, developed by Richard Bellman in 1953, can obtain optimal alignments in such cases very efficiently. Needleman and Wunsch (1970) first published a dynamic programming algorithm to find the optimum alignment between two biological sequences, and over the next several years several variations of that method were developed, differing in how the alignments were scored and how they treated gaps. Most of the efforts were directed at global alignments, where both sequences are aligned along their entire lengths. The more challenging problem was to find local alignments, where only a portion of the two sequences has significant similarity. Local alignment is needed to compare genomes with each other, or even to ask whether a homologue of a particular gene occurs within a genome sequence. Smith and
Waterman (1981) solved the problem of identifying the local alignment using dynamic programming in a way that allows for full use of similarity matrices such as PAM, treats gaps in an intelligent way, and is guaranteed to find the optimal solution efficiently.
While the Smith-Waterman algorithm solves the problem of optimal local alignments, it is not efficient enough for the very large database searches that were becoming necessary by the mid- to late 1980s owing to an exponential increase in database size. The BLAST program, published in 1990, was a major breakthrough (Altschul et al., 1990). This was an important collaboration between two computer scientists (Myers and Miller), a mathematician (Altschul), a medical doctor (Lipman), and a biologist (Gish). It employed a fast heuristic search algorithm for the local alignment problem, and the algorithm’s sensitivities are not much reduced from those of the Smith-Waterman algorithm. At the same time, Altschul and Karlin, another mathematician, developed statistical methods to allow computing the significance of the matches found by BLAST (Karlin and Altschul, 1990). When the National Institutes of Health (NIH) made the program available over the newly arrived Internet, biologists around the world suddenly had access to sophisticated database searches to compare their own sequences with the known sequences. Just as the large-scale genome sequencing projects were being contemplated, but before they had truly begun in earnest, this critical piece of software had been developed that would greatly expedite the projects.
EVOLUTIONARY PROCESSES IN POPULATIONS
In the early 1980s population genetics theory took a dramatic turn. Before that time, most theoretical work was focused on the analysis of allele frequencies for two, or perhaps a few, variants in just one or two genes. Interest focused on the frequency spectrum or heterozygosity that one observed at enzyme loci, using protein gel electrophoresis to assay variation. The theory was based primarily on diffusion approximations of two-allele systems (higher-dimensional systems being intractable). When it became clear that surveys of DNA polymorphism would become available via resequencing, it was obvious that some different quantities would need to be studied. With this new kind of data, one would know not only the number of alleles and their frequencies in a sample but also, from long stretches of linked sites, the number of mutational steps by which all the sequences differed from each other. This opened a new window on the evolutionary processes occurring in populations.
To understand the variation revealed by sequencing of alleles, several investigators at that time began focusing on the distributional properties of gene trees. Gene trees, under the standard finite-population-size mod-
els, are random structures. The early work, spearheaded by Kingman (1982) and Tajima (1983), showed that many properties of sequence variation could be simply understood in terms of the properties of the genealogical tree relating the sampled sequences. Many other players quickly entered the arena. This area of population genetics became known as coalescent theory. The early theory dealt with the simplest models, which had constant population size, no spatial structure, no recombination, and no selection. Over the next 15 years the theory was generalized to cover models in which all of these limitations had been removed. It is now routine to think about genetic variation in terms of the size and shape of gene trees that relate sampled sequences. It is also routine to simulate samples under many models using efficient algorithms based on the coalescent approach. The models of the coalescent process are often very simple to describe, consisting of relatively straightforward Markov chains, but the genealogical structures that arise are in some cases surprisingly challenging to analyze and rich in their connections with other areas of stochastic processes. For example, coalescent methods for models with selection have connections with the theory of interacting particles and dual processes (Krone and Neuhauser, 1997).
The coalescent approach has led to new insights about the models, to new analytic results (Krone and Neuhauser, 1997), to numerical methods for obtaining likelihoods (Kuhner et al., 2000; Griffiths and Tavaré, 1995), and to very efficient simulation algorithms (Hudson, 1983).
As a further illustration of successful interactions between mathematics and biology, consider two separate historical examples of mathematical modeling. The development of the Hodgkin-Huxley equations (Hodgkin and Huxley, 1952) to describe the evolution of action potentials was of profound importance. Their description of ionic currents through ion selective channels provided a paradigm that is still used extensively today in models of cellular electrophysiology. The understanding of excitability that came from their model is also of remarkably general applicability. Perhaps more significant, however, was the recognition that spatially extended systems of excitable components could support waves of invariant form and allow for robust signaling over great distances.
It is now understood that this combination of excitability over spatially extended networks provides the basis for communication and control of many fundamental biological processes. Communication along one-dimensional excitable pathways is remarkably robust and reliable. Yet, in two- and three-dimensional spatially extended networks, other robust patterns (e.g., re-entrant spirals) can arise that overrun the normal
function and lead to serious pathology. As a consequence, there is currently a significant effort to understand how to prevent these other naturally occurring, but pathological, patterns and how to get rid of them when they do occur.
A second illustrative example of modeling success is the suggestion of the existence of dendro-dendritic synapses in the olfactory bulb by Rall and Shepherd (1968). It was widely believed, before that time, that axons were excitable, dendrites were passive, and synaptic connections were made between axons and dendrites only. In order to match certain extracellular potentials that were measured experimentally in the olfactory bulb, Rall and Shepherd found that a compartment model with active dendrites was required. As a result, they suggested that dendro-dendritic pathways were likely to exist, and that they provided a novel mechanism for recurrent inhibition. It was only years later that experimental technique improved to the point where such synapses were indeed found to exist. This is an example of the healthy interplay between modeling and experiment, wherein each drives the other in the pursuit of more complete understanding.
MEDICAL AND BIOLOGICAL IMAGING
The last few decades have seen dramatic advances in imaging technology. In medicine, magnetic resonance imaging (MRI) and computed x-ray tomography (CT) are playing increasingly important roles in both diagnosis and treatment, with new applications emerging every year. The importance of the mathematical sciences to biomedical imaging was emphasized in the 1996 National Research Council report Mathematics and Physics of Emerging Biomedical Imaging:
While exponential improvements in computing power have contributed to the development of today’s biomedical imaging capabilities, computing power alone does not account for the dramatic expansion of the field, nor will future improvements in computer hardware be a sufficient springboard to enable the development of the biomedical imaging tools described in this report. That development will require continued research in physics and the mathematical sciences, fields that have contributed greatly to biomedical imaging and will continue to do so. (p. 9)
For example, the mathematical foundations for image reconstruction in x-ray CT date back to the work of Johann Radon in the early 1900s. It was around 1970, however, that machines first provided images of value in medical diagnosis, mainly owing to the efforts of A.M. Cormack and G.N. Hounsfield. They observed that by measuring the net attenuation
along large numbers of individual x-ray pencil beams, one could reconstruct the attenuation coefficient point by point across a complete cross section of the human body. Nevertheless, several hours of computation time were required to obtain a single image, and the quality was relatively poor. The original numerical methods for reconstruction were based on iterative relaxation of a system of equations, with each equation representing the discretization of an integral measuring the net attenuation along a single line. When Shepp and Logan (1974) and others introduced filtered back projection, it was possible to substantially improve both image quality and reconstruction time, and CT scans became much more practical.
For the purposes of this chapter, it is worth highlighting a very specific mathematical contribution and discussing its ramifications in the context of computational biology more broadly. That contribution was the introduction of the mathematical “phantom” by Shepp and Logan (1974) and Shepp and Kruskal (1978). Consider the situation where one is seeking to compare the performance of a variety of reconstruction methods. The standard approach before Shepp and Logan’s work had been to create actual physical models with known characteristics, from which data were measured and reconstruction performed. This seems natural, but errors may stem from inaccuracies in performing the physical experiment as well as in the reconstruction. The Shepp-Logan phantom is a mathematically defined function from which exact (artificial) data can be created, including any desired noise model. Its importance was made clear when Shepp and Logan turned their attention to a ring of high density slightly inside the skull that was observed when the first CT machine was introduced and that was believed to be a previously unrecognized anatomic feature. The use of mathematical phantoms was instrumental in showing that this ring was in fact an artifact of the reconstruction algorithm.
This chapter gives some indication—but certainly not an exhaustive account—of the long history of interaction between the mathematical and biological sciences. It also demonstrates, by example, the breadth of that interaction: Many areas of biology have been affected by many areas of mathematical science, and the challenges of biology have also prompted advances of importance to the mathematical sciences themselves. Sometimes the benefits of mathematical sciences research have been direct, and sometimes they have arisen in ways that were not predicted. As these examples show, the right mathematical approach can have a dramatic
impact on whether or not a particular biological construct is feasible—for example, in the case of finding protein homologues or reconstructing CT scans—and developing the right mathematical representation of a phenomenon can enable very productive research—for example, in the study of populations or exploring signaling mechanisms.
Aldrich, J. 1997. R.A. Fisher and the making of maximum likelihood 1912-1922. Statist. Sci. 12: 162-176.
Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215(3): 403-410.
Cohen, J.E. 2004. Mathematics is biology’s next microscope, only better; biology is mathematics’ next physics, only better. PLoS Biol. 2(12): e439.
Dayhoff, M.O., R.V. Eck, M.A. Chang, and M.R. Sochard. 1965. Atlas of Protein Sequence and Structure. Silver Spring, Md.: National Biomedical Research Foundation.
Fisher, R.A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd.
Fisher, R.A. 1930. The Genetical Theory of Natural Selection. Oxford: Clarendon Press.
Fisher, R.A. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd.
Griffiths, R.C., and S. Tavaré. 1995. Unrooted genealogical tree probabilities in the infinitely-many-sites model. Math. Biosci. 127(1): 77-98.
Hodgkin, A.L., and A.F. Huxley. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. 117: 500-544.
Hudson, R.R. 1983. Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23(2): 183-201.
Karlin, S., and S.F. Altschul. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A. 87(6): 2264-2268.
Kimura, M. 1968. Evolutionary rate at the molecular level. Nature 217(129): 624-626.
Kingman, J.F.C. 1982. On the genealogy of large populations. J. Appl. Prob. 19A: 27-43.
Krone, S.M., and C. Neuhauser. 1997. Ancestral processes with selection. Theor. Popul. Biol. 51(3): 210-237.
Kuhner, M.K., J. Yamato, and J. Felsenstein. 2000. Maximum likelihood estimation of recombination rates from population data. Genetics 156(3): 1393-1401.
National Research Council. 1996. Mathematics and Physics of Emerging Biomedical Imaging. Washington, D.C.: National Academy Press.
Needleman, S.B., and C.D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3): 443-453.
Rall, W., and G.M. Shepherd. 1968. Theoretical reconstruction of field potentials and dendrodentritic synapse interactions in olfactory bulb. J. Neurophysiol. 31: 884-915.
Shepp, L.A., and J.B. Kruskal. 1978. Computerized tomography: The new medical x-ray technology. Am. Math. Mon. XX: 420-439.
Shepp, L.A., and B.F. Logan. 1974. The Fourier reconstruction of a head section. IEEE Trans. Nucl. Sci. 21: 21-43.
Smith, T.F., and M.S. Waterman. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147(1):195-197.
Tajima, F. 1983. Evolutionary relationships of DNA sequences in finite populations. Genetics 105: 437-460.
Yates, F., and K. Mather. 1963. Ronald Aylmer Fisher 1890-1962. Pp. 91-120 in Biographical Memoirs of Fellows of the Royal Society of London, Vol. 9. London, England: The Royal Society.
Zuckerkandl, E., and L. Pauling. 1965. Molecules as documents of evolutionary history. J. Theor. Biol. 8: 357-366.