Page 114

Calibrating the Clock:

Using Stochastic Processes to Measure the Rate of Evolution

Deoxyribonucleic acid (DNA) sequences record the history of life. Although DNA replication is remarkably accurate, mutations do occur at a small but nonnegligible rate, with the result that an individual's descendants begin to diverge in DNA sequence over time. By examining DNA sequences among different species or among different individuals within a single species, it is possible to reconstruct aspects of their evolutionary history. Such studies have been pursued with special interest in the human, where an unusual DNA sequence called the mitochondrial genome has been used to trace human migrations and human evolution. The author shows how mathematical tools from the theory of stochastic processes assist in calibrating the molecular clock inherent in DNA sequences. |

While DNA sequences are transmitted from parent to child with remarkable fidelity, mutations do occur at a small but nonnegligible rate, with the result that an individual's descendants begin to diverge in DNA sequence over time. Some mutations are deleterious and are eliminated by natural selection, but many are thought to be selectively neutral and thus accumulate at a roughly steady rate—providing a molecular clock for measuring the time since two species or two individuals within a species shared a common ancestor. In this manner, it is possible to reconstruct an evolutionary tree and even estimate the times of key separation events.

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 114

Page 114
Chapter 5— Calibrating the Clock: Using Stochastic Processes to Measure the Rate of Evolution Simon Tavaré University of Southern California
Deoxyribonucleic acid (DNA) sequences record the history of life. Although DNA replication is remarkably accurate, mutations do occur at a small but nonnegligible rate, with the result that an individual's descendants begin to diverge in DNA sequence over time. By examining DNA sequences among different species or among different individuals within a single species, it is possible to reconstruct aspects of their evolutionary history. Such studies have been pursued with special interest in the human, where an unusual DNA sequence called the mitochondrial genome has been used to trace human migrations and human evolution. The author shows how mathematical tools from the theory of stochastic processes assist in calibrating the molecular clock inherent in DNA sequences.
While DNA sequences are transmitted from parent to child with remarkable fidelity, mutations do occur at a small but nonnegligible rate, with the result that an individual's descendants begin to diverge in DNA sequence over time. Some mutations are deleterious and are eliminated by natural selection, but many are thought to be selectively neutral and thus accumulate at a roughly steady rate—providing a molecular clock for measuring the time since two species or two individuals within a species shared a common ancestor. In this manner, it is possible to reconstruct an evolutionary tree and even estimate the times of key separation events.

OCR for page 114

Page 115
Different biological sequences within an organism may obey different clocks. The amino acid sequence of a protein encoded by a gene changes more slowly than the DNA sequence of the underlying gene because many amino acid changes may be selectively disadvantageous (because they disrupt function). On the other hand, a significant proportion of DNA changes may be selectively neutral because they create a synonymous codon (that is, one that specifies the same amino acid). Similarly, DNA regions within genes change at a slower rate than the DNA sequences located between genes. Accordingly, evolutionary studies of distant species are often carried out by examining amino acid sequences of proteins, while evolutionary comparisons among more closely related species are better done by examining DNA sequences within or between genes.
To study evolution within a single species such as the human, it is often useful to study DNA sequences that change at especially rapid rates. The mitochondrial genome provides an ideal substrate for such studies. The mitochondrion is an organelle found in the cytoplasm of eukaryotic cells, whose primary role is to generate high-energy compounds that the cell uses to drive chemical reactions. Although the mitochondria use many proteins that are encoded by genes in the cell nucleus, each mitochondrion has its own small circular chromosome that encodes a few dozen genes essential for its function.
In the human, the mitochondrial genome consists of 16,569 base pairs whose DNA sequence has been completely determined (Anderson et al., 1981). Human mitochondria are inherited only from the mother, and so their genealogy is considerably simpler to follow than for genes encoded in the nucleus (which are inherited from both parents and are subject to recombination between the two homologous copies in the cell). Conveniently for evolutionary studies, mitochondrial DNA (mtDNA) has an increased rate of nucleotide substitution compared to nuclear genes, owing to the presumed absence of certain DNA repair mechanisms. Moreover, the mitochondrial genome contains certain regions that are particularly tolerant of mutation, that is, appear to be subject to little selective pressure (Avise, 1986) and thus show a great deal of variation. In all, the mitochondrial genome may be evolving 10 times faster than the nuclear genome.
For these reasons, molecular population geneticists have carried out many studies of the DNA sequences of mitochondrial variable regions in many human populations (Di Rienzo and Wilson, 1991; Horai and

OCR for page 114

Page 116
Hayasaka, 1990; Vigilant et al., 1989, 1991; Ward et al., 1991). Studies of mitochondrial sequences of different Native American tribes strongly suggest that there were multiple waves of colonization of North America by migrant groups from Asia, and even allow one to estimate the dates of these events (Schurr et al., 1990; Ward et al., 1991). Assuming a constant evolutionary rate, the pattern of mutations between diverse human groups has been used to argue (Cann et al., 1987) that the mitochondria of all living humans descended from a mother that lived in Africa some 200,000 years ago—the so-called Eve hypothesis. Although the precise details of the hypothesis are disputed (Maddison, 1991; Nei, 1992; Templeton, 1992), the general power of the methodology is well accepted. (As an aside, the reader should note that the existence of a common ancestor—Eve, so to speak—is a mathematical necessity in any branching process that satisfies very weak conditions. The biological controversies pertain to when and where Eve lived.)
Each of these applications requires a knowledge of the rate at which mutations occur in an mtDNA sequence. Estimates of this rate have been obtained by comparing a single DNA sequence from each of several species whose times of divergence are presumed known. Divergence is calculated from the number of nucleotide differences between species (using methods that correct for the possibility of multiple mutations at a site), and rate estimates are obtained by dividing the amount of sequence divergence by the divergence time. For data taken from multiple individuals in a single population, one requires a model that takes account of the population genetic aspect of the sampling: individuals in the sample are correlated by their common ancestry. In this chapter, we describe the underlying stochastic structure of this ancestry and use the results to estimate substitution rates.
We have chosen to focus on rate estimation to give the chapter a single theme. We are not interested per se in statistical aspects of tests for selective neutrality of DNA differences; rather, we assume neutrality for the data sets discussed as examples. The techniques described here should be regarded as illustrative of the theoretical and practical problems that arise in sequence analysis of samples from closely related individuals. The emphasis is on exploratory methods that might be used to summarize the structure of such samples.

OCR for page 114

Page 117
Overview To illustrate the methods, we use a set of North American Indian mitochondrial sequences described in Ward et al. (1991). These authors sequenced the first 360 base pairs of the mitochondrial control region for a sample of 63 Nuu-Chah-Nulth (Nootka) Indians from Vancouver Island. The sample comprises individuals who were maternally unrelated for four generations, chosen from 13 of the 14 tribal bands. As a consequence the sample deviates from a truly random sample, although it will be treated as such for the purposes of this chapter. An important parameter in the analysis is the effective population size of the group. This is approximated by the number of reproducing females, giving a value of about 600 for the long-term effective population size N.
The most common DNA changes seen in mitochondria are transitions (changes from one pyrimidine base to the other or one purine base to the other, that is, C « T or A « G) rather than transversions (changes from a pyrimidine to a purine or vice versa). Indeed, the sequenced region shows no transversions, so that each site in the sequences has one of just two possible nucleotides. We focus on the pyrimidine (C or T) sites in the region. There are 201 such sites, in which 21 variable (or segregating) sites define 24 distinct sequences (called alleles or lineages). The details of the data, including the allele frequencies, are given in Table 5.1.
The parameter of particular interest here is q, the population geneticist's stock in trade. The variable q is a measure of the mutation rate in the region, and it figures in many important theoretical formulas in population genetics. For mitochondrial data, it is defined by
q = 2Nu,
where N is the effective population size referred to earlier, and u is the mutation rate per gene per generation. Once q is estimated, we can estimate u if N is known or N if u is known. In what follows, we estimate the compound parameter q rather than its components.
In the section immediately following, we begin by outlining the structure of the coalescent, a robust description of the genealogy of samples taken from large populations. The effects of mutation are superimposed on this genealogy in several ways. The classical case, which

OCR for page 114

Page 118
Table 5.1 Nucleotide Position in Control Region
1
1
1
1
1
2
2
2
2
2
2
2
2
3
3
3
3
3
Allele Frequencies
Position
6
8
9
2
4
6
6
9
0
1
3
4
5
6
7
7
0
0
0
1
3
9
8
1
4
9
2
6
4
0
9
3
7
5
7
1
5
1
2
4
9
9
Site
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
ID ref
T
C
C
C
T
C
T
T
C
C
C
C
C
C
C
T
T
T
C
T
T
1
·
·
·
·
·
·
C
·
T
·
·
·
T
·
·
·
·
·
·
·
·
2&3
·
·
·
·
·
·
·
·
T
·
·
·
T
·
·
·
·
·
·
·
·
3
4
·
·
·
·
·
·
·
·
T
·
·
·
T
·
·
·
·
·
·
·
C
5
·
T
·
·
·
T
·
·
T
·
·
·
·
T
·
·
·
·
·
·
C
3
6
·
T
·
·
·
·
·
·
T
·
·
·
·
·
·
·
·
·
·
·
C
2
7
C
T
·
·
·
·
·
·
T
·
·
T
·
·
·
·
·
·
·
·
C
1
8,10&11
·
T
·
·
·
·
·
·
T
·
·
·
·
T
·
·
·
·
·
·
C
8
9
C
T
·
·
·
·
·
·
T
·
·
·
·
T
·
·
·
·
·
·
C
2
12&13
·
T
·
·
·
·
·
·
·
·
·
·
·
T
·
·
·
·
·
·
C
10
14
·
T
·
·
·
·
·
·
T
·
·
·
T
T
·
·
·
·
·
·
C
1
15
·
T
·
·
·
·
·
·
T
·
·
·
T
T
·
·
C
·
·
·
C
2
16
·
·
·
·
·
·
·
·
T
T
·
·
·
·
·
·
·
·
T
·
C
1
17
·
·
·
T
·
·
·
·
T
·
·
·
·
·
·
·
C
·
·
·
C
1
18
·
·
·
T
·
·
·
·
T
·
·
·
·
·
·
·
·
C
·
·
C
2
19
·
·
T
·
C
·
·
·
T
·
·
·
·
·
T
·
·
C
·
·
C
1
20
·
·
·
·
·
·
·
·
T
·
·
·
·
·
·
·
·
C
·
·
C
3
21
·
·
·
·
·
·
·
·
T
·
·
·
·
·
·
·
·
C
·
·
C
3
22
C
·
·
·
·
·
·
·
T
·
·
·
·
·
·
·
·
C
·
·
·
3
23
·
·
·
·
·
·
·
·
T
T
·
·
·
·
·
C
·
C
·
·
·
1
24
·
·
·
·
·
·
·
·
T
·
·
·
·
·
·
C
·
C
T
·
·
7
25
·
·
·
·
·
·
·
·
T
T
·
·
·
·
·
C
·
C
T
C
·
3
26
·
·
·
·
·
·
·
·
·
T
·
·
·
·
·
C
·
C
T
C
·
1
27
·
·
·
·
·
·
C
C
·
·
·
·
·
·
·
·
·
·
·
·
·
1
28
·
·
·
·
·
·
C
C
·
·
T
·
·
·
·
·
·
·
·
·
·
1
NOTE: These mitochondrial data from Ward et al. (1991, Figure 1) are the variable pyrimidine positions in the control region. Position 69 corresponds to position 16,092 in the human reference sequence published by Anderson et al. (1981). The ID numbers correspond to those given in Ward et al. (1991, Figure 1).

OCR for page 114

Page 119
records the allelic partition of the sample, leads to the sampling theory of the infinitely-many-alleles model initiated by Ewens (1972). The Ewens sampling formula is then described, followed by a brief digression into the simulation structure of mutations in the coalescent, both in top-down and bottom-up form. Next, the infinitely-many-sites model is introduced as a simple description of the detailed structure of the segregating sites in the sample. Finally, we return to classical population genetics theory, albeit from a coalescent point of view, to discuss the structure of K-allele models. This in turn develops into the study of the finitely-many-sites models, which play a crucial role in the study of sequence variability when back substitutions are prevalent.
In the next section we digress to present a mathematical vignette in the area of random combinatorial structures. The Ewens sampling formula was derived as a means to analyze allozyme frequency data that became prevalent in the late 1960s. Current population genetic data is more sequence oriented and requires more detailed models for its analysis. Nonetheless, the combinatorial structure of the Ewens sampling formula has recently emerged as a useful approximation to the component counting process of a wide range of combinatorial objects, among them random permutations, random mapping functions, and factorization of polynomials over a finite field. We show how a result of central importance in the development of statistical inference for molecular data has a new lease on life in an area of discrete mathematics.
The final section briefly discusses some of the outstanding problems in the area, with particular emphasis on likelihood methods for coalescent processes. Some aspects of the mathematical theory, for example, measure-valued diffusions, are also mentioned, together with applications to other, more complicated, genetic mechanisms.
The Coalescent and Mutation The genealogy of a sample of n genes (that is, stretches of DNA sequence) drawn at random from a large population of approximately constant size may be described in terms of independent exponential random variables Tn,Tn-1,. . .,T2 as follows. The time Tn during which the sample has n distinct ancestors has an exponential distribution with parameter n(n - 1)/2, at which time two of the lines are chosen at random to coalesce,

OCR for page 114

Page 120
giving the sample n - 1 distinct ancestors. The time Tn-1 during which the sample has n - 1 such ancestors is exponentially distributed with parameter (n - l)(n - 2) / 2, at which point two more ancestors are chosen at random to coalesce. This process of coalescing continues until the sample has two distinct ancestors. From that point, it takes an exponential amount of time T2 with parameter 1, to trace back to the sample's common ancestor. For our purposes, the time scale is measured in units of N generations, where N is the (effective) size of the population from which the sample was drawn. This structure, made explicit by Kingman (1982a,b), arises as an approximation for large N to many models of reproduction, among them the Wright-Fisher and Moran models. A sample path of a coalescent with n = 5 is shown in Figure 5.1.
Figure 5.1 Sample path of the coalescent for n = 5. Tj denotes the time during which the sample has j distinct ancestors. Tj has an exponential distribution with mean 2/j(j - 1).
From the description of the genealogy, it is clear that the time tn back to the common ancestor has mean

OCR for page 114

Page 121
or approximately 2N generations for large sample sizes. Further aspects of the structure of the ancestral process may be found in Tavaré (1984). Rather than focus further on such issues, we describe how the genealogy may be used to study the genetic composition of the sample.
To this end, assume that in the population from which the sample was drawn there is a probability u that any gene mutates in a given generation, mutation acting independently for different individuals. In looking back r generations through the ancestry of a randomly chosen gene, the number of mutations along that line is a binomial random variable with parameters r and u. If we measure time in units of N generations, so that r = [Nt] (that is, r is Nt rounded down to the next lower integer), and assume that 2Nu® qas N ® ¥, then the Poisson approximation to the binomial distribution shows that the number of mutations in time t has in the limit a Poisson distribution with mean q t / 2. This argument can be extended to show that the mutations that arise on different branches of the coalescent tree follow independent Poisson processes, each of rate q / 2. For example, the total number of mutations µn that occur in the history of our sample back to its common ancestor has a mixed Poisson distribution—given Tn, Tn-1,. . .,T2, µn has a Poisson distribution with mean . The mean and variance of the number of mutations are given by Watterson (1975):
(5.1)
and
(5.2)
We are now in a position to describe the effect that mutation has on the individuals in the sample.

OCR for page 114

Page 122
The Ewens Sampling Formula Motivated by the realization that mutations in DNA sequences could lead to an essentially infinite number of alleles at the given locus, Kimura and Crow (1964) advocated modeling the effects of mutation as an infinitely-many-alleles model. In this process, a gene inherits the type of its ancestor if no mutation occurs and inherits a type not currently (or previously) existing in the population if a mutation does occur. In such a process the genes in the sample are thought of as unlabeled, so that the experimenter knows whether two genes are different, but records nothing further about the identity of alleles. In this case the natural statistic to record about the sample is its configuration Cn º (C1, C2,. . ., Cn), where
Cj = number of alleles represented j times.
Of course, C1 + 2C2+ . . . + nCn = n, and the number of alleles in the sample is
Kn º C1 + C2 + . . . +Cn. (5.3)
The sampling distribution of Cn was found by Ewens (1972):
(5.4)
for a = (a1,a2,. . .,an) satisfying aj ³ 0 for j = 1,2,. . .,n and and where
q (n)º q (q + 1)···(q+ n- 1).
From (5.4) it follows that

OCR for page 114

Page 123
(5.5)
and
(5.6)
being the Stirling number of the first kind. From (5.5) and (5.4) it follows that Kn is sufficient for q, so that the information in the sample relevant for estimating q is contained just in Kn. This allows us (Ewens, 1972, 1979) to calculate the maximum likelihood (and moment) estimator of q as the solution of the equation
(5.7)
where k is the number of alleles observed in the sample. In large samples, the estimator has variance given approximately by
(5.8)
For the pyrimidine sequence data described above in the ''Overview" section, there are k = 24 alleles. Solving equation (5.7) for gives = 10.62, with a variance of 9.89. An approximate 95 percent confidence interval for q is therefore 10.62 ± 6.29. This example serves to underline the variability inherent in estimating q from this model. The pyrimidine region comprises 201 sites, so that the per site substitution rate is estimated to be 0.053 ± 0.031.
The goodness of fit of the model to the data may be assessed by using the sufficiency of Kn for q: given Kn, the conditional distribution of the allele frequencies is independent of q. Ewens (1972, 1979) gives further details on this point. To describe alternative goodness-of-fit methods, we return briefly to the probabilistic structure of mutation in the coalescent.

OCR for page 114

Page 124
Forwards and Backwards in the Tree Hudson (1991) describes many situations in which simulation of genealogical trees is useful. In its simplest form, the idea is to construct (a simulation of) a coalescent tree, with times and branching order, and then superimpose the effects of mutation on this tree using the Poisson nature of the mutation process. In this section we make use of two equivalent descriptions of the effects of mutation in the coalescent tree in which the mutation and coalescence events evolve simultaneously.
Top-down The first of these methods is a very useful "top-down" scheme exploited by Ethier and Griffiths (1987) in the context of the infinitely-many-sites model. We start at the common ancestor of the sample and think of the genetic process running down to the sample. Just after the first split, we have a sample of two individuals, each of identical genetic type. Attach to each individual a pair of independent exponential alarm clocks—one of rate q / 2, the second of rate 1/2—and suppose the clocks are independent for different individuals. The q clocks will determine mutations, the other clocks split times. Now watch the clocks until the first one rings: if a q-clock rings, a mutation occurs in that gene, whereas if one of the other clocks rings, a split occurs in which that gene is copied, now making a sample of three individuals. Using the standard "competing exponentials" argument, the probability that a mutation occurs first is
whereas a split occurs first with probability 1/ (q + 1). Furthermore, given that a mutation occurs first, the gene in which it occurs is chosen uniformly and at random, and given that a split occurs first, the gene that is copied is chosen uniformly and at random.
Once an event occurs, the process repeats itself in a similar way. Suppose, then, that there are currently m genes in the sample. Attach independent mutation clocks of rate q / 2 and independent split clocks of

OCR for page 114

Page 142
There are many uses to which such total variation estimates can be put. In essence, functionals of the dependent process that depend mainly on the small component counts (that is, on components of size o(n)) are well approximated by the corresponding functionals of the independent process, which are often much easier to analyze. A typical example shows that the total number of components in such a structure asymptotically has a normal distribution, with mean and variance qk log n . A corresponding functional central limit theorem follows by precisely the same methods. In addition, these estimates lead to bounds on the distances between the laws of such functionals. Some examples that illustrate the power of this approach can be found in Arratia and Tavaré (1992) and Arratia et al. (1993).
Other Combinatorial Structures The strategy employed for assemblies also works for other combinatorial structures, including multisets and selections. We focus just on the multiset case. To build such structures, which are now unlabeled, imagine a supply of mj types of irreducible component of weight j, and build an object of total weight n by choosing components with replacement. The number N(a) of structures of weight n having aj components of size j = 1,2,. . .,n is
(5.34)
and the total number of structures of weight n is p(n) = SaN(a). A random multiset of size n has aj components of size j with probability
(5.35)
The ingredient common to assemblies and multisets is the fact that

OCR for page 114

Page 143
but the approximating independent random variables {Zi} are no longer Poisson, but rather negative binomial with parameters mi and xi:
(5.36)
valid for 0 < x < 1. In the q-biased case, the Zi are negative binomial with parameters mi and qxi, for any q < x-1.
The most studied example in this setting concerns the factorization of a random monic polynomial over the finite field GF(q) with q elements. The components of size i are precisely the irreducible factors of degree i, there being
of them. The function m(·) is the Möbius function: m(k) = -l or 1 according to whether k is the product of an odd or even number of distinct prime factors, and m(k) = 0 if k is divisible by the square of a prime. The logarithmic condition
(5.37)
is satisfied by random polynomials with k = 1 and y = q. For this logarithmic class the total variation estimates (5.32) and (5.33) apply once more (with appropriate modification for the q-biased case), and the techniques described at the end of the previous section can then be used to study the behavior of many interesting functionals. In particular, examples describing the functional central limit theorem, with error estimates, for the random polynomial case, can be found in Arratia et al. (1993).

OCR for page 114

Page 144
The Large Components Thus far we have described how we might approximate a complicated dependent process (the counts of small components) by a simpler, independent process, with an estimate of the error involved. It is natural to ask what can be said about the large component counts. To describe this, we return once more to the ESF.
Let L1 º L1 (n) ³ L2 ³···³ LK denote the sizes of the largest cycle, the second largest cycle, . . . , the smallest of the random number of cycles in a q-biased random permutation. We define Lj = Lj(n) = 0, j > K. It is known from the work of Kingman (1974, 1977) that the random vector n-1 (L1,L2,. . .,LK,0,0,. . .) converges in distribution to a random vector (X1, X2,. . .). The vector X has the Poisson-Dirichlet distribution with parameter q, which we denote by PD(q). There are a number of characterizations of PD(q), among them Kingman's original definition: Let s1 ³ s2 ³ ···³ 0 denote the points of a Poisson process on (0,¥) having mean measure with density 0 e-x / x, x > 0, and set s = Si³1 si . Then
We know that the large components, those that are of a size about n, of a q-biased random permutation are described asymptotically by the PD(q) law. What can be said about the large components of the other combinatorial structures we have seen? We focus once more on the logarithmic structures that satisfy either condition (5.31) or (5.37), where population genetics has a crucial role to play once more.
In approximating the behavior of counts of large components Cr = (Cr+l, Cr+2,. . .,Cn) we should not expect to be able to compare to an independent process because, for example, there can be at most components of size j or greater, and this condition forces very strong correlations on the counts of large components. However, we should be able to compare the component counting process Cr of the combinatorial

OCR for page 114

Page 145
structure to the ESF process , say. The approximating process is still discrete and, although not independent, it has a simpler structure than the original process. For random polynomials, it is shown in Arratia et al. (1993) that
(5.38)
so that the counts of factors of large degree can indeed be compared successfully to the corresponding counts for the ESF. The estimate in (5.38) has as a consequence the fact that the (renormalized) factors of largest degree have asymptotically the PD(1) law, a result that also follows from work of Hansen (1994). In addition, a rate of convergence is also available. In fact, (5.38) essentially holds for any of the logarithmic class (cf. Arratia et al., 1994a).
In conclusion, we have seen that a variety of interesting functionals of the component structure of certain combinatorial processes can be approximated in total variation norm by either those functionals of an independent process or those functionals of the ESF itself. The important aspect of this is the focus on discrete approximating processes, rather than those found by renormalizing to obtain a continuous limit. In a very real sense, our knowledge of ''the biology of random permutations," as described by the ESF, has provided a crucial ingredient in one area of probabilistic combinatorics.
Where to Next? In the preceding sections, we have illustrated how coalescent techniques can be used to model the evolution of samples of selectively neutral DNA sequence data. Simple techniques for estimating substitution rates, some based on likelihood methods and some on more ad hoc moment methods, were reviewed. We also illustrated how the probabilistic structure of the coalescent might be used to simulate observations in order to assess the variability of such estimators.

OCR for page 114

Page 146
Likelihood Methods Notwithstanding the lack of recombination and selection, inference about substitution rates poses some difficult statistical and computational problems. Most of these are due to the apparently heterogeneous nature of the substitution process in different regions of the sequence. One of the outstanding open problems in this area is the development of practical likelihood methods for sequence data. Inference techniques for sequence data from a fixed (but typically unknown) tree are reviewed in Felsenstein (1988). The added ingredient in the population genetics setting is the random nature of the coalescent itself—in principle, we have to average likelihoods on trees over the underlying coalescent sample paths. The computational problems involved in this are enormous. The likelihood can be thought of as a sum (over tree topologies) of terms, in each of which the probability of the configuration of alleles, given the branching order and coalescence times Tn,Tn-1,. . .,T2, is averaged over the law of Tn,Tn-1,. . .,T2. Monte Carlo techniques might be employed in its evaluation. One approach, using a bootstrap technique, is described by Felsenstein (1992).
An alternative approach is to compute likelihoods numerically using the recursion in equation (5.17). The probabilistic structure of the coalescent takes care of the integration, and the problem is, in principle at least, simpler. For small sample sizes and simple mutation schemes this is possible (see Lundstrom (1990), for example), but it is computationally prohibitive even for samples of the size discussed earlier. An alternative is the Markov chain Monte Carlo approach in Griffiths and Tavaré (1994a), in which equation (5.17) is used to construct an absorbing Markov process in such a way that the probability q(n) in (5.17) is the expected value of a functional of the process up to the absorption time. That is, represent q(n) as
(5.39)
where {N(j), = 0,1,. . .} is a stochastic process determined by (5.17), and t is the time it takes this process to reach a particular set of states. Classical simulation methodology can now be used to simulate independent

OCR for page 114

Page 147
observations with mean q(n). The scheme in (5.39) can be modified to estimate the entire likelihood surface from a single run, providing a computationally feasible method for approximating likelihood surfaces.
As an illustration, we return to the mitochondrial data described in the subsection on the infinitely-many-sites model above. We saw that of the 21 segregating sites in the sample, 14 were consistent with an infinitely-man-ysites model. The remaining 7 sites are described in Table 5.2. These data comprise a sample of 63 individuals from a K = 27 = 128 allele model. The allele frequencies are given in Table 5.2.
Table 5.2 Incompatible Sites and Frequencies
Sequence
Site 0
1 T
9 T
10 C
13 C
17 T
18 T
19 C
Frequency
1
0
0
0
1
0
0
0
8
2
0
0
0
0
0
0
0
12
3
1
0
0
0
0
0
0
3
4
0
1
0
0
0
0
0
12
5
0
0
0
1
1
0
0
2
6
0
0
1
0
0
0
1
1
7
0
0
0
0
1
1
0
1
8
0
0
0
0
0
1
0
9
9
1
0
0
0
0
1
0
3
10
0
0
1
0
0
1
0
1
11
0
0
0
0
0
1
1
7
12
0
0
1
0
0
1
1
3
13
0
1
1
0
0
1
1
1
NOTE: Data are from Table 5.1. The row labeled 0 gives the nucleotide corresponding to 0 at that site. The last column gives the frequencies of the alleles in the sample.
The observed fraction of T nucleotides is pT = 207 / 441 = 0.469, and so p C= 0.531. We use these to determine the per-site mutation rate matrix Q in (5.15):

OCR for page 114

Page 148
where s = 7. Assuming that pC and pT are given by their observed frequencies, there is just the single parameter q to be estimated. Preliminary simulation results give the maximum likelihood estimate of q at about = 17 . This corresponds to a per site C ® T rate of a = 1.14, and a per site T ® C rate of b = 1.28. These rates are about 50 times higher than those based on the analysis in the section on the K-allele models above using all 201 sites. Of course, this set of sites was chosen essentially because of the high mutation rates in the region and so should represent an extreme estimate of the rates in the whole molecule. Nonetheless, the results do point to the lack of homogeneity in substitution rates in this molecule. For other approaches to the modeling of hypervariable sites, see Lundstrom et al. (1992b).
Discussion The emphasis in this chapter has been the discussion of inference techniques for the coalescent, a natural model for the analysis of samples taken from large populations.
An interesting development in the mathematical theory has been the study of measure-valued diffusions initiated by Fleming and Viot (1979). This is a generalization of the "usual" diffusions so prevalent in the classical theory of population genetics, described for example in Ewens (1979, 1990) and Tavaré (1984). A comprehensive discussion of the Fleming-Viot process appears in Ethier and Kurtz (1993), where the probabilistic structure of a broad range of examples, such as multiple loci with recombination, infinitely many alleles with selection, multigene families, and migration models, are discussed in some detail.
Perhaps the most important aspect of the theory that has seen rather little theoretical treatment thus far is the area that might loosely be called variable population size processes, and their inference. These issues are becoming more important in the analysis and interpretation of human mitochondrial sequence data. Two recent articles in this area are Slatkin

OCR for page 114

Page 149
and Hudson (1991) and Rogers and Harpending (1992). Lundstrom et al. (1992b) note that the effects of variable population size on gene frequency distributions can readily be confounded with the effects of hypervariable regions in the sequences. A careful assessment of the interaction of these two effects seems important, as does a detailed treatment of the effects of spatial structure and population subdivision on the analysis of sequence diversity. The Monte Carlo likelihood methods developed for sequence data in Griffiths and Tavaré (1994a) adapt readily to situations like this. See, for example, Griffiths and Tavaré (1994b.) They offer a practical approach to inference from very complicated stochastic processes. These techniques are based on genealogical arguments that provide the cornerstone of a firm quantitative basis for the analysis of DNA sequence data and our understanding of genomic diversity.
References General-Purpose References Arratia, R., and S. Tavaré, 1994, "Independent process approximations for random combinatorial structures," Adv. Math. 104, 90-154.
Avise, J.C., 1986, "Mitochondrial DNA and the evolutionary genetics of higher animals," Philos. Trans. R. Soc. London, Ser. B 312, 325-342.
Ethier, S.N., and T.G. Kurtz, 1993, "Fleming-Viot processes in population genetics," SIAM J. Control Optim. 31, 345-386.
Ewens, W.J., 1979, Mathematical Population Genetics, New York: Springer-Verlag. Ewens, W.J., 1990, "Population genetics theory-the past and the future," pp. 177-227 in Mathematical and Statistical Developments of Evolutionary, Theory, S. Lessard (ed.), Holland: Kluwer Dordrecht.
Felsenstein, J., 1982, "Numerical methods for inferring evolutionary trees," Quarterly Review of Biology 57, 379-404.
Felsenstein, J., 1988, "Phylogenies from molecular sequences: inference and reliability," Annu. Rev. Genet. 22, 521-565.
Hudson, R.R., 1991, "Gene genealogies and the coalescent process," pp. 1-44 in Oxford Surveys in Evolutionary Biology 7, D. Futuyma and J. Antonovics (eds.).
Tavaré, S., 1984, "Line-of-descent and genealogical processes, and their applications in population genetics models," Theor. Popul. Biol. 26, 119-164.

OCR for page 114

Page 150
Detailed References Anderson, S., A. Bankier, B. Barrell, M. deBruijn, A. Coulson, J. Drouin, I. Eperon, D. Nierlich, B. Roe, F. Sanger, P. Schreier, A. Smith, R. Staden, and I. Young, 1981, "Sequence and organization of the human mitochondrial genome," Nature 290, 457-465.
Arratia, R., and S. Tavaré, 1992, "Limit theorems for combinatorial structures via discrete process approximations," Random Struct. Algebra 3, 321-345.
Arratia, R., A.D. Barbour, and S. Tavaré, 1992, "Poisson process approximations for the Ewens sampling formula," Ann. Appl. Probab. 2, 519-535.
Arratia, R., A.D. Barbour, and S. Tavaré, 1993, "On random polynomials over finite fields," Math. Proc. Cambridge Philos. Soc. 114, 347-368.
Arratia, R., A.D. Barbour, and S. Tavaré, 1994a, "Logarithmic combinatorial structures," Ann. Probab., in preparation.
Arratia, R., D. Stark, and S. Tavaré, 1994b, "Total variation asymptotics for Poisson process approximations of logarithmic combinatorial assemblies," Ann. Probab., in press.
Cann, R., M. Stoneking, and A.C. Wilson, 1987, "Mitochondrial DNA and human evolution," Nature 325, 31-36.
Di Rienzo, A., and A.C. Wilson, 1991, "Branching pattern in the evolutionary tree for human mitochondrial DNA," Proceedings of the National Academy of Sciences USA 88, 1597-1601.
Estabrook, G.F., C.S. Johnson, Jr., and F.R. McMorris, 1976, "An algebraic analysis of cladistic characters," Discrete Math. 16, 141-147.
Ethier, S.N., and R.C. Griffiths, 1987, "The infinitely-many-sites model as a measure valued diffusion," Ann. Probab. 15, 515-545.
Ewens, W.J., 1972, "The sampling theory of selectively neutral alleles," Theor. Popul. Biol. 3,87-112.
Felsenstein, J., 1992, "Estimating effective population size from samples of sequences: A bootstrap Monte Carlo integration approach," Genet. Res. Camb. 60, 209-220.
Flajolet, P., and A.M. Odlyzko, 1990, "Random mapping statistics," pp. 329-354 in Proc. Eurocrypt '89, J.-J. Quisquater (ed.), Lecture Notes in Computer Science 434, Springer-Verlag.
Fleming, W.H., and M. Viot, 1979, "Some measure-valued Markov processes in population genetics theory," Indiana Univ. Math. J. 28, 817-843.
Goncharov, V.L., 1944, "Some facts from combinatorics," Izv. Akad. Nauk. SSSR, Ser. Mat. 8, 3-48 [in Russian]; "On the field of combinatory analysis," Trans. Am. Math. Soc. 19, 1-46.
Griffiths, R.C., 1987, "An algorithm for constructing genealogical trees," Statistics Research Report 163, Department of Mathematics, Monash University.
Griffiths, R.C., 1989, "Genealogical-tree probabilities in the infinitely-many-sites model," J. Math. Biol. 27, 667-680.
Griffiths, R.C., and S. Tavaré, 1994a, "Simulating probability distributions in the coalescent process," Theor. Popul. Biol. 46, 131-159.
Griffiths, R.C., and S. Tavaré, 1994b, "Sampling theory for neutral alleles in a varying environment," Philos. Trans. R. Soc. London, Ser. B 344, 403-410.

OCR for page 114

Page 151
Griffiths, R.C., and S. Tavaré, 1994c, "Unrooted genealogical tree probabilities in the infinitely-many-sites model," Math. Biosci., in press.
Hansen, J.C., 1994, "Order statistics for decomposable combinatorial structures," Random Struct. Alg. 5, 517-533.
Horai, S., and K. Hayasaka, 1990, "Intraspecific nucleotide sequence differences in the major noncoding region of human mitochondrial DNA," Am. J. Hum. Genet. 46, 828-842.
Kimura, M., 1969, "The number of heterozygous nucleotide sites maintained in a finite population due to steady influx of mutations," Genetics 61, 893-903.
Kimura, M., and J.F. Crow, 1964, "The number of alleles that can be maintained in a finite population," Genetics 49, 725-738.
Kingman, J.F.C., 1974, "Random discrete distributions, "J. R. Stat. Soc. 37, 1-22.
Kingman, J.F.C., 1977, "The population structure associated with the Ewens sampling formula," Theor. Popul. Biol. 11, 274-283.
Kingman, J.F.C., 1982a, "On the genealogy of large populations," J. Appl. Probab. 19A, 27-43.
Kingman, J.F.C., 1982b, "The coalescent," Stoch. Processes Appl. 13, 235-248.
Kolchin, V.F., 1976, "A problem of the allocation of particles in cells and random mappings," Theory Probab. Its Applic. (Engl. Transl.) 21, 48-63.
Kolchin, V.F., 1986, Random Mappings, New York: Optimization Software, Inc.
Lundstrom, R., 1990, Stochastic Models and Statistical Methods for DNA Sequence Data, PhD thesis, Department of Mathematics, University of Utah.
Lundstrom, R., S. Tavaré, and R.H. Ward, 1992a, "Estimating mutation rates from molecular data using the coalescent," Proceedings of the National Academy of Sciences USA 89, 5961-5965.
Lundstrom, R., S. Tavaré, and R.H. Ward, 1992b, "Modelling the evolution of the human mitochondrial genome," Math. Biosci. 122, 319-336.
Maddison, D.R., 1991, "African origin of human mitochondrial DNA reexamined," Systematic Zoology 40, 355-363.
McMorris, F.R., 1977, "On the compatibility of binary qualitative taxonomic characters," Bull. Math. Biol. 39, 133-138.
Mutafciev, L., 1984, "On some stochastic problems of discrete mathematics," pp. 57-80 in Mathematics and Education in Mathematics (Sunny Beach), Sophia, Bulgaria: Bulgarian Academy of Sciences.
Nei, M., 1992, "Age of the common ancestor of human mitochondrial DNA," Mol. Biol. Evol. 9, 1176-1178.
Rogers, A., and H. Harpending, 1992, "Population growth makes waves in the distribution of pairwise genetic differences," Mol. Biol. Evol. 9, 552-569.
Schurr, T., S. Ballinger, Y. Gan, J. Hodge, D.A. Merriwether, D. Lawrence, W. Knowler, K. Weiss, and D. Wallace, 1990, "Amerindian mitochondrial DNAs have rare Asian mutations at high frequencies, suggesting they derived from four primary maternal lineages," Am. J. Hum. Genet. 47, 613-623.
Slatkin, M., and R.R. Hudson, 1991, "Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations," Genetics 129, 555-562.

OCR for page 114

Page 152
Templeton, A.R., 1992, "Human origins and analysis of mitochondrial DNA sequences," Science 255, 737-754.
Vigilant, L., R. Pennington, H. Harpending, T. Kocher, and A.C. Wilson, 1989, "Mitochondrial DNA sequences in single hairs from a South African population," Proceedings of the National Academy of Sciences USA 86, 9350-9354.
Vigilant, L., M. Stoneking, H. Harpending, K. Hawkes, and A.C. Wilson, 1991, "African populations and the evolution of human mitochondrial DNA," Science 253, 1503-1507.
Ward, R.H., B.L. Frazier, K. Dew, and S. Piabo, 1991, "Extensive mitochondrial diversity within a single Amerindian tribe," Proceedings of the National Academy of Sciences USA 88, 8720-8724.
Watterson, G.A., 1975, "On the number of segregating sites in genetical models without recombination," Theoret. Popul. Biol. 7, 256-276.
Wright, S., 1968, Evolution and the Genetics of Populations, Vol. 2, Chicago: University of Chicago Press.