Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 146 Likelihood Methods Notwithstanding the lack of recombination and selection, inference about substitution rates poses some difficult statistical and computational problems. Most of these are due to the apparently heterogeneous nature of the substitution process in different regions of the sequence. One of the outstanding open problems in this area is the development of practical likelihood methods for sequence data. Inference techniques for sequence data from a fixed (but typically unknown) tree are reviewed in Felsenstein (1988). The added ingredient in the population genetics setting is the random nature of the coalescent itselfâin principle, we have to average likelihoods on trees over the underlying coalescent sample paths. The computational problems involved in this are enormous. The likelihood can be thought of as a sum (over tree topologies) of terms, in each of which the probability of the configuration of alleles, given the branching order and coalescence times Tn,Tnâ1,. . .,T2, is averaged over the law of Tn,Tnâ1,. . .,T2. Monte Carlo techniques might be employed in its evaluation. One approach, using a bootstrap technique, is described by Felsenstein (1992). An alternative approach is to compute likelihoods numerically using the recursion in equation (5.17). The probabilistic structure of the coalescent takes care of the integration, and the problem is, in principle at least, simpler. For small sample sizes and simple mutation schemes this is possible (see Lundstrom (1990), for example), but it is computationally prohibitive even for samples of the size discussed earlier. An alternative is the Markov chain Monte Carlo approach in Griffiths and Tavaré (1994a), in which equation (5.17) is used to construct an absorbing Markov process in such a way that the probability q(n) in (5.17) is the expected value of a functional of the process up to the absorption time. That is, represent q(n) as (5.39) where {N(j), = 0,1,. . .} is a stochastic process determined by (5.17), and Ï is the time it takes this process to reach a particular set of states. Classical simulation methodology can now be used to simulate independent
CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 147 observations with mean q(n). The scheme in (5.39) can be modified to estimate the entire likelihood surface from a single run, providing a computationally feasible method for approximating likelihood surfaces. As an illustration, we return to the mitochondrial data described in the subsection on the infinitely-many-sites model above. We saw that of the 21 segregating sites in the sample, 14 were consistent with an infinitely-many- sites model. The remaining 7 sites are described in Table 5.2. These data comprise a sample of 63 individuals from a K = 27 = 128 allele model. The allele frequencies are given in Table 5.2. Table 5.2 Incompatible Sites and Frequencies Sequence Site 1 9 10 13 17 18 19 Frequency 0 T T C C T T C 1 0 0 0 1 0 0 0 8 2 0 0 0 0 0 0 0 12 3 1 0 0 0 0 0 0 3 4 0 1 0 0 0 0 0 12 5 0 0 0 1 1 0 0 2 6 0 0 1 0 0 0 1 1 7 0 0 0 0 1 1 0 1 8 0 0 0 0 0 1 0 9 9 1 0 0 0 0 1 0 3 10 0 0 1 0 0 1 0 1 11 0 0 0 0 0 1 1 7 12 0 0 1 0 0 1 1 3 13 0 1 1 0 0 1 1 1 NOTE: Data are from Table 5.1. The row labeled 0 gives the nucleotide corresponding to 0 at that site. The last column gives the frequencies of the alleles in the sample. The observed fraction of T nucleotides is ÏT = 207 / 441 = 0.469, and so Ï C= 0.531. We use these to determine the per-site mutation rate matrix Q in (5.15):