From the earliest days of population genetics, mathematics has played an important role in the field. Until the 1960s, most population genetics theory focused on deductive analysis, and the models were generally focused on following the evolution of populations that were presumed to have originally been located in just one or two places. Early investigators showed how evolution would proceed under plausible models of genetic inheritance and natural selection. These analyses illuminated the dynamics of allele frequencies in populations, and they showed with what speed evolution could occur and how this speed depended on various parameters. Both deterministic models and models with random genetic drift were examined. Diffusion approximations to Markov chains were particularly important (Kimura, 1983). These diffusion processes could be analyzed by solving simple ordinary differential equations to obtain important quantities such as the probability of fixation of a new variant and the mean time for such fixation. These analyses strongly shaped our current understanding of natural selection in large, but finite, populations and guided experimental work. In recent years the emphasis has shifted from these deductive activities to inductive or retrospective approaches that address the question of what we can infer about evolutionary history and the nature of the evolutionary process from current patterns of genetic variation.
The primary goal of population and evolutionary geneticists today is to understand patterns of genetic variation within populations and pat-
terns of genetic divergence between species. Population geneticists have asked, What are the important forces that determine the amount and nature of genetic variation in populations, the spatial distribution of this variation, the distribution of variation across the genome, and the evolutionary changes that occur over short and long timescales? The process that has shaped this variation within and between species is a complex one involving a complex genome and a complex, spatially and temporally varying environment. It is certain that stochasticity is an important aspect of the process. The rapidly growing database of DNA polymorphism and divergence studies from a variety of organisms, including humans and other primates, provide an exciting opportunity to learn about the evolutionary history of populations and the evolutionary processes that have resulted in the patterns of variation that we observe in extant populations. The difficulty is that even very simple models of this process lead to challenging mathematical problems.
Some examples of current approaches and the mathematical challenges facing us are described here. To be concrete and to avoid an overly vague description of the problems, a very specific population genetic model of sequence evolution will be described. The particular model, the Wright-Fisher model, has a long and rich history, but it is not necessarily the most realistic or tractable for every purpose, and it is only one of many models that might have been considered here.
The Wright-Fisher model assumes discrete generations (as opposed to a model with distinct age classes and overlapping generations, which would be more realistic for some populations, including humans). The focus is on a particular segment of the genome, referred to as a gene, and it is first assumed that no recombination or mutation occurs. To begin, it is assumed as well that population size (N) is constant and that there is no spatial structure. A haploid model is also assumed, which means that each individual carries just one copy of the gene. (Humans are in fact diploid, which means that each individual carries two copies of each gene, a maternal and a paternal copy.)
In the Wright-Fisher model, successive generations are produced as follows. Each of the N individuals of the offspring generation is produced by replicating, without error, the gene sequence of a randomly drawn individual of the parental generation. Each offspring individual is assumed to be generated independently from the parental population in this manner. The number of offspring of any particular individual of the parent generation is thus a random variable, being the result of N independent Bernoulli trials, with probability of success equal to 1/N. In large populations, the number of offspring of any individual would approximately follow a Poisson distribution with mean 1. If it is supposed that the parents do not all have identical gene sequences, then their distinct
gene sequences are known as haplotypes. Given the frequencies of the different haplotypes in the parental generation, the numbers of the different haplotypes in the offspring generation will be multinomially distributed. Regardless of how much variation existed in the founding population, the population under this model will eventually arrive at a state in which every individual carries the same sequence. This process of random change in the frequencies of the different haplotypes is referred to as genetic drift, and it eventually results in the population becoming monomorphic.
Next, mutation is introduced into the model. Let it be supposed that the replication process that generates an offspring copy of the gene from its parent has some error rate, so that each offspring differs from its parent at a Poisson-distributed number of sites in the gene sequence. If this model is run for many generations, the pattern of genetic variation asymptotically approaches a stationary distribution resulting from a stochastic balance between mutation, which generates variation in the population, and genetic drift, which tends to eliminate variation. Many properties of this stationary distribution are known. Also, many properties of samples drawn from this stationary distribution are known. In the model as it has been defined here, all individuals are in some sense equivalent. For example, all individuals have the same distribution of offspring number with expectation equal to 1. All the genetic variation is said to be selectively neutral, and the model is referred to as a neutral model. In generalizations of this model, some sequence variants may have a systematic tendency to produce more offspring than others, and the frequency of such variants will tend to increase. These are models of evolution with natural selection.
The Wright-Fisher neutral model is a particular case of a more general class of neutral models in which all parents are equivalent; these are referred to as “exchangeable models.” In these models, the distribution of offspring number need not be Poissonian. In the limit of large populations and a low mutation rate, the models’ stationary properties depend on a single compound parameter, Nu/v, where u is the mutation rate and v is the variance of offspring number. Despite the simplicity of this model, in which there is no selection, no geographic structure, no variation in population size, and no recombination, the probabilities of sample configurations of sequences under this model are difficult to calculate. Strobeck (1983) first described recursions for these probabilities for the case where only two or three haplotypes are present in a sample. The difficulty of obtaining sample configuration probabilities led to the use of summaries of the data, with an inevitable loss of information. Only in the last 10 years have full likelihood approaches been developed. Griffiths and Tavaré (1994a, 1995) were the first to find a practical method to esti-
mate full likelihoods for this simple neutral model using a method based on importance sampling. Kuhner et al. (1995) described a Markov chain Monte Carlo (MCMC) method for obtaining quantities proportional to the sample probabilities. The main point here is that the sampling properties of sequence data under even this simplest, one-parameter, neutral model lead to recursions that are not analytically tractable. Monte Carlo methods have provided a way forward. Much of the progress in understanding these models is based on analyzing properties of the genealogical relationships of sampled copies rather than analyzing the dynamics of population frequencies of haplotypes. The population genetics theory of sample genealogies has come to be known as calescent theory. The early important work in this area was done by Watterson (1975), Kingman (1982), and Tajima (1983). (See Chapter 2 for more information on this.)
For models without recombination, it is possible to extend these Monte Carlo methods to the case in which population size is not held constant. For some special cases, it is possible to infer past population size changes (Griffiths and Tavaré, 1994b; Kuhner et al., 1995). Additional Monte Carlo methods for demographic inference using other types of genetic data (microsatellite data, for example) have also been developed (Beaumont, 1999). Much remains to be done in this area.
Models with geographic structure are more difficult. Historically, simple “island” models have been employed. These subdivide the population, but with a special structure in which each subdivision is assumed to communicate equally with all other subdivisions. More realistic stepping-stone models are more difficult, but some results are known (Durrett, 2002). Wakeley (2004) recently investigated a class of models with large numbers of subdivisions and obtained elegant results for this model. This work has capitalized on results for coalescent processes that operate on different timescales. Bayesian Monte Carlo methods have again begun to play an important role in analyzing data of several types (Pritchard et al., 2000; Beaumont, 1999).
If recombination is added to the model, the difficulties increase enormously. With recombination, each offspring produced in the model has some small probability, r, of being the product of two parent individuals, one parent contributing a part of the gene on the left and the other parent contributing the rest of the gene, the boundary between the two contributions being random. In models with recombination, complex statistical dependencies between sites arise. Sample configuration probabilities are very difficult to obtain. For a model with just two sites between which recombination can occur, a relatively simple recursion can be written down for sample probabilities (Golding, 1984). These recursions are intractable for all but very small samples, as the state space becomes enormous. For more than two sites the situation quickly gets much worse.
Griffiths and Marjoram (1996) present recursions for sample configuration probabilities under the infinite-sites version of this model with recombination. These are not analytically tractable, but Monte Carlo methods have been described for estimating these sampling probabilities under this model. However, unlike the case without recombination, it appears that these Monte Carlo methods are computationally infeasible for samples of interesting size because convergence, while difficult to assess, appears to take inordinate amounts of computer time. As a consequence, approximate methods, ad hoc methods, and methods based on summary statistics are still the rule when analyzing data from genes with recombination (Stephens, 2001). Much interest has focused on making inferences about recombination rates and gene conversion rates under models in which the rates vary across the genome (McVean et al., 2004). Improved methods could contribute to understanding the genetic mechanisms of recombination and also help in the mapping of disease genes via association studies.
There is also great interest in assessing the importance of natural selection in shaping patterns of variation within populations and the divergence between populations. Many ad hoc tests have been developed over the years (Kreitman, 2000; Bustamante et al., 2003) to explore these questions. Devising methods that make more efficient use of the data and combine information from many loci should be a priority. These inferences about selection must be made in the context of realistic models of population structure and demographic history. More realistic models of mutation and recombination are also needed, building on results shown in recent work such as (Hwang and Green, 2004; Meunier and Duret, 2004). This recent work points out how much there is still to learn about molecular evolution and how rich the mathematical models will need to be to capture it.
The focus thus far in this chapter has been on variation within species, but comparisons of sequences from different species can also be very informative, both about evolutionary relationships of species (the phylogenetic inference problem) and about evolutionary processes. Again, recent years have seen important progress on likelihood methods (Felsenstein, 1981) and, most recently, on the use of Monte Carlo approaches (Huelsenbeck and Ronquist, 2001; Wong et al., 2004). Combining between-and within-species data can be very useful, as is well exemplified by the analysis of Poisson random field models (Bustamante et al., 2003). A remaining challenge in phylogenetic inference includes the problem of performing multiple alignments and phylogenetic reconstruction simultaneously. Currently, alignment is carried out while ignoring phylogenetic relationships, and phylogenetic reconstruction is carried out only with
fully aligned sequences produced prior to the reconstruction. This clearly is not an optimal solution.
Genetic data are becoming available at an ever-increasing rate. More loci, more species, and more individuals within species will be surveyed. More and more frequently, essentially complete genomes will be compared. These advances result in new opportunities and new mathematical and computational challenges. Different biological questions, different organisms, and different types of genetic markers will require somewhat different models, different methods, and different approximations. With more and richer data sets, researchers will be able to fruitfully consider somewhat more complex models, with increased demographic complexity (bottlenecks and expansions, more complex spatial structure) and increased genetic complexity (heterogeneous recombination and mutation rates), and with more complex types of natural selection (interaction between sites and spatial and temporal variation in selection coefficients). These complexities present significant mathematical challenges. Stochastic models, possibly with many parameters, and complex, nonindependent data make the computational difficulties substantial. Insight about the models and mathematical skill will be needed to make progress. While a wide range of approaches will clearly contribute to advances, the important roles that Monte Carlo methods and Bayesian approaches have recently played seem likely to continue.
ECOLOGICAL ASPECTS OF POPULATIONS
Population growth with density dependence was formulated mathematically by Verhulst (1838), who developed a number of models to investigate the consequences of deviations from unrestricted growth. One of the models, known as the logistic equation, remains a standard model for population growth. It is described by the following differential equation:
where N(t) is the population size at time t, r is the intrinsic rate of growth, and K is the carrying capacity. This model continues to serve as an important illustration of the effects of negative density dependence, as encoded by the term in parentheses. It is an example of a phenomenological model, one that is intended to embody, in a concise form, some of the observed behaviors of populations, but it has also been used as a predictive model: Based on data on the U.S. population from 1790 to 1910, Pearl and Reed
(1920) fitted the logistic equation to the observed growth of the U.S. population, estimating that it would level off at 197 million.
A discrete-time version of the logistic growth equation exhibits surprising properties, from periodic behavior to chaos. This latter behavior was introduced to ecology by May (1974, 1976). Although a known phenomenon in mathematics, the idea that deterministic models can exhibit unpredictable behavior was new to ecologists at that time and still spawns new research in ecology (Cushing et al., 2002). Although difficult to verify in nature, experimental systems have been developed to test for chaos (Costantino et al., 1997). The theoretical study of single populations in a purely ecological context remains challenging. Much research is being devoted to anthropogenic impacts on natural and managed systems. Mathematical challenges include modeling and analysis of spatial aspects, temporal and spatial variation, demographic stochasticity (in particular when dealing with small populations), and nonequilibrium dynamics. The committee highlights two areas of interest: species extinction and food supply.
The threat of species extinction arising from either habitat fragmentation or species invasion has resulted in much theoretical work. The need here is for both conceptual models, which will give us a better understanding of the underlying mechanisms, and predictive models, which can be used for management. Much of the theoretical work focuses on single-species models. On the empirical side, observational studies dominate, and data are not always unequivocal (Debinski and Holt, 2000). There are few controlled experiments of habitat fragmentation and species invasions because of the difficulties associated with these experiments.
Reliable food supply depends on the ability to manage this renewable resource. Fisheries management is an example that has enjoyed sophisticated modeling by both economists and biologists. Conceptual and highly species-specific models are widespread, but many focus only on single populations. Large, spatially explicit data sets exist, although there are troublesome uncertainties associated with some of the data. More realistic models also must deal with the uncertainties associated with management strategies and inherently stochastic processes, such as birth and death. They also need to take food web structure into account. A comprehensive model that includes visualization tools was developed by James Kitchell and co-workers for the Central North Pacific ecosystem to assess the effects of fishing on productivity (Hinke et al., 2004). Visualizations are useful to convey the impact of different management scenarios to managers. The development of such complex models requires a thorough understanding of the ecological interactions in addition to long-term data to parameterize the models and test scenarios.
These models are typically so complex that simulation is the only tool currently available for investigation.
In the two cases discussed here, the need is for models that incorporate the complex interactions of the target species with its surroundings, often in spatially heterogeneous habitats and under nonequilibrium conditions. Models that are used for management purposes often include social and economic aspects. The importance of developing methods to study nonequilibrium dynamics cannot be overemphasized (see, for example, Hastings, 2004). Some of the mathematical theory has been developed, in particular when different timescales are involved. Many ecological interactions occur far away from equilibrium, and large-scale anthropogenic perturbations, such as land use change, species invasions, or alterations of nutrient or carbon cycles, exacerbate this problem. Land use change and the invasion of exotic species often result in rapid changes and thus have the potential to move a system far from equilibrium, with consequences for both ecological and evolutionary processes (see below). Even experimental systems are probably not in equilibrium: Most field experiments are studied over only short timescales, even if the dynamics are slow.
A SYNTHESIS OF ECOLOGY AND EVOLUTION
During ecology’s early years, evolutionary thinking was prevalent. However, as ecology focused more and more on abiotic and biotic causes of diversity and species abundance, evolutionary thinking became less prominent (Collins, 1986). Much of ecology now operates under the premise that ecological and evolutionary processes act on different timescales. Evolutionary processes are often thought to take hundreds of generations before their effects can be measured, whereas ecological processes often show effects after a few generations. This has led to the intellectual separation of ecology and evolution. For systems that are under strong selection, however, this may not be the case. There are classic examples, such as melanism among moths as a response to air pollution (Kettlewell, 1955) or the heavy-metal tolerance of plants (Bradshaw, 1952). As a consequence, purely ecological or purely genetic models are often inadequate when strong selection is acting (Neuhauser et al., 2003). An increasing number of studies are combining ecological and evolutionary models to meet this challenge of understanding the consequences of ecological and evolutionary forces acting on similar timescales (Antonovics, 1992; Thompson, 1999; Whitman et al., 2003).
The mathematical challenges when both ecological and evolutionary processes are considered simultaneously are numerous. First, the dimen-
sionality increases because additional parameters must be introduced to model both ecological and evolutionary processes. Second, these processes are frequently both spatial and stochastic. The study of spatial stochastic systems is an active area of mathematical research, but at this point, only the simplest models seem to be tractable in a rigorous way. Third, the interplay between ecological and evolutionary processes is most pronounced during transient dynamics. Since there are no readily available analytical methods for nonequilibrium processes that are spatial and stochastic, most studies resort to simulations as the primary way to gain insights. The following example illustrates a biological problem and the mathematical challenges it brings.
As an example of why one might wish to couple ecological and evolutionary processes, and of the mathematical challenges that result, consider the evolution of resistance. This is of importance, for instance, in understanding the ramifications of the use of transgenic Bt crops, which have been engineered to express a toxin from the soil bacterium Bacillus thuringiensis (Bt). Engineered versions are available for a number of crop plants, such as maize, potatoes, cotton, and soybeans. In maize, the toxin is expressed at high levels and is toxic to the European corn borer, Ostrinia nubilalis (Hübner), the key herbivore insect pest. An important concern is the pest’s development of resistance to the toxin (Tabashnik, 1994; Gould, 1998). Current practice is to plant a “refuge” of non-Bt maize alongside Bt maize to allow sufficient numbers of susceptible European corn borers to be available as mates if resistant types emerge from the Bt field. To model the evolution of resistance, two things are needed: nonequilibrium models for at least two types of patches (Bt field and non-Bt field) and the ability to study the time-varying genetic composition of European corn borer populations throughout these modeled patches. One of the first models that incorporated these aspects was developed by Comins (1977). Since then, other models have been developed, and each seems to reveal additional complexities. A consistent theory is still lacking, and it will need to also take into account the community dynamics of associated enemies of the insect pest (see, for example, Neuhauser et al., 2003).
The field of population biology focuses largely on single populations. Except under controlled experimental situations, populations rarely live in isolation. Populations are typically embedded in communities, and their dynamics are strongly influenced by other members of the community. These feedbacks greatly complicate our understanding of the dynamics and present great challenges. The next chapter will discuss communities.
Antonovics, J. 1992. Toward community genetics. Pp. 426-449 in Plant Resistance to Herbivores and Pathogens: Ecology, Evolution, and Genetics. R.S. Frite and E.L. Simms, eds. Chicago, Ill.: University of Chicago Press.
Beaumont, M. 1999. Detecting population expansion and decline using microsatellites. Genetics 153: 2013-2029.
Bradshaw, A.D. 1952. Populations of Agrostis tenuis resistant to lead and zinc poisoning. Nature 169: 1098-1099.
Bustamante, C.D., R. Nielsen, and D.L. Hartl. 2003. Maximum likelihood and Bayesian methods for estimating the distribution of selective effects among classes of mutations using DNA polymorphism data. Theor. Popul. Biol. 63(2): 91-103.
Collins, J.P. 1986. Evolutionary ecology and the use of natural selection in ecological theory. J. Hist. Biol. 19: 257-288.
Comins, H.N. 1977. The development of insecticide resistance in the presence of migration. J. Theor. Biol. 64: 177-179.
Costantino, R.F., R.A. Desharnais, J.M. Cushing, and B. Dennis. 1997. Chaotic dynamics in an insect population. Science 275: 389-391.
Cushing, J.M., R.F. Constantino, B. Dennis, R.A. Desharnais, and S.M. Henson. 2002. Chaos in Ecology: Experimental Nonlinear Dynamics. San Diego, Calif.: Academic Press.
Debinski, D.M., and R.D. Holt. 2000. A survey and overview of habitat fragmentation experiments. Conserv. Biol. 14: 342-355.
Durrett, R. 2002. Probability Models for DNA Sequence Evolution. New York, N.Y.: Springer-Verlag.
Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17(6): 368-376.
Golding, G.B. 1984. The sampling distribution of linkage disequilibrium. Genetics 108: 257-274.
Gould, F. 1998. Sustainability of transgenic insecticidal cultivars: Integrating pest genetic and ecology. Annu. Rev. Entomol. 43: 701-726.
Griffiths, R.C., and S. Tavaré. 1994a. Simulating probability distributions in the coalescent. Theor. Popul. Biol. 46: 131-159.
Griffiths, R.C., and S. Tavaré. 1994b. Sampling theory for neutral alleles in a varying environment. Phil. Trans. Roy. Soc. Lond. B Biol. Sci. 344(1310): 403-410.
Griffiths, R.C., and S. Tavaré. 1995. Unrooted genealogical tree probabilities in the infinitely-many-sites model. Math. Biosci. 127(1): 77-98.
Griffiths, R.C., and P. Marjoram. 1996. Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol. 3(4): 479-502.
Hastings, A. 2004. Transients: The key to long-term ecological understanding? Trends Ecol. Evol. 19: 39-45.
Hinke, J.T., I.C. Kaplan, K. Aydin, G.M. Watters, R.J. Olson, and J.F Kitchell. 2004. Visualizing the food-web effects of fishing for tunas in the Pacific Ocean. Ecol. Soc. 9: 10.
Huelsenbeck, J.P., and F. Ronquist. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17(8): 754-755.
Hwang, D.G., and P. Green. 2004. Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc. Natl. Acad. Sci. U.S.A. 1011(39): 13994-14001.
Kettlewell, H.B.D. 1955. Selection experiments on industrial melanism in the Lepidoptera. Heredity 9: 323-342.
Kimura, M. 1983. The Neutral Theory of Molecular Evolution. New York, N.Y.: Cambridge University Press.
Kingman, J.F.C. 1982. On the genealogy of large populations. J. App. Prob. 19A: 27-43.
Kreitman, M. 2000. Methods to detect selection in populations with applications to the human. Annu. Rev. Genomics Hum. Genet. 1: 539-559.
Kuhner, M.K., J. Yamato, and J. Felsenstein. 1995. Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. Genetics 140(4): 1421-1430.
May, R.M. 1974. Biological populations with non-overlapping generations: Stable points, stable cycles and chaos. Science 186: 645-647.
May, R.M. 1976. Simple mathematical models with very complicated dynamics. Nature 262: 495-467.
McVean, G.A., S.R. Myers, S. Hunt, P. Deloukas, D.R. Bentley, and P. Donnelly. 2004. The fine-scale structure of recombination rate variation in the human genome. Science 304: 58104.
Meunier, J., and L. Duret. 2004. Recombination drives the evolution of GC-content in the human genome. Mol. Biol. Evol. 21: 984-990.
Neuhauser, C., D.A. Andow, G.E. Heimpel, G. May, R.G. Shaw, and S. Wagenius. 2003. Community genetics: Expanding the synthesis of ecology and genetics. Ecology 84: 545-558.
Pearl, R., and L. J. Reed. 1920. On the rate of growth of the population of the United States since 1870 and its mathematical representation. Proc. Natl. Acad. Sci. U.S.A. 6: 275-288.
Pritchard, J.K., M. Stephens, and P. Donnelly. 2000. Inference of population structure using multilocus genotype data. Genetics 155(2): 945-959.
Stephens, M. 2001. Inference under the coalescent. Pp. 213-238 in Handbook of Statistical Genetics. D.J. Balding, M. Bishop, and C. Cannings, eds. New York, N.Y.: John Wiley & Sons Ltd.
Strobeck, C. 1983. Estimation of the neutral mutation rate in a finite population from DNA sequence data. Theor. Popul. Biol. 24(2): 160-172.
Tabashnik, B.E. 1994. Evolution of resistance to Bacillus thuringiensis. Annu. Rev. Entomol. 39: 47-79.
Tajima, F. 1983. Evolutionary relationships of DNA sequences in finite populations. Genetics 105: 437-460.
Thompson, J.N. 1999. Specific hypotheses on the geographic mosaic of coevolution. Am. Nat. 153S: 1-14.
Verhulst, P.F. 1838. Notice sur la loi que la population suit dans son accroissement. Corresp. Math. Phys. 10: 113-121.
Wakeley, J. 2004. Recent trends in population genetics: More data! More math! Simple models? J. Heredity 95: 397-405.
Watterson, G.A. 1975. On the number of segregating sites in genetic models without recombination. Theor. Popul. Biol. 7: 256-276.
Whitham, T.G., W. Young, G.D. Martinsen, C.A. Gehring, J.A. Schweitzer, S.M. Shuster, G.M. Wimp, D.G. Fischer, J.K. Bailey, R.L. Lindroth, S. Woolbright, and C.R. Kuske. 2003. Community genetics: A consequence of the extended phenotype. Ecology 84: 559-573.
Wong, W.S., Z. Yang, N. Goldman, and R. Nielsen. 2004. Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. Genetics 168(2): 1041-1051.