Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 132 Because the sampling formula for general Q is not known explicitly, it is useful to have a way to compute it. Perhaps the simplest is an application of the bottom-up method described above. Define q(n) = P(N = n), and set ej = (0,. . .,0,1,0,. . .,0), the 1 occurring in position j. Look up the tree to the first event that occurred. This is either a mutation (with probability θ / (θ + n â 1) ) or a coalescence (with probability (n â 1) / (θ + n â 1)). By considering the configuration of the sample at this event, we see that q(n) satisfies the recursion (5.17) where q(n) = 0 if any ni < 0, and q(ei) = . The process is stationary if = Ïi for all i. We exploit this recursion more fully in the section below on likelihood methods. The Finitely-Many-Sites Models We now have the machinery necessary to describe the finitely-many-sites model for molecular sequence data involvingn sequences, each of s sites. The sites are thought of as completely linked, and each site is typically one of either 2 or 4 possibilities. At its grossest level, the finitely-many-sites model is "just" a K-allele model in which K = 2s or 4s. From an inference point of view, however, there are far too many parameters in such a model, and some simplification is required. The simplest null model of sequence evolution is the case in which mutations occur at a rate of θ / 2 per gene, but when a mutation occurs, a site is chosen at random to mutate and the base at that site changes according to a mutation matrix M. A slightly more general model might allow site j to mutate with probability pj, once more according to M. For a two-type classification of each site, the first model has two parameters to be estimated, and the second has s + 1. These schemes can be modified to allow for other correlation
CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 133 structures between the sites at the expense of more complicated methods of analysis. Motivated by our sequence data, we concentrate on the two-state case and discuss methods for estimating the parameters of the simplest null model. At a single site, the model behaves exactly like a 2-allele process with because the per site substitution rate is θ / s. This has the structure of (5.15), with α = m12θ / s and β = m21θ / s. The distribution of sites is exchangeable (since, conditional on the coalescent tree, mutations are laid down independently at each site; this is a simple example of a marked Poisson process argument), and in particular the sites have identical distributions. They are not of course independent because of correlations induced by the common ancestry in the coalescent. However, some simple properties of the sequences are easy to calculate. In particular, the number Sn of segregating sites has mean E[Sn] = sP(site is segregating) = s(l â g(0) â g(n)), (5.18) where g(·) is given by (5.16). The equation (5.18) provides a simple heuristic method for estimating the parameters of the process. First, the equilibrium base frequencies Ï1 = β / (α + β) and Ï2 = α / (α + β) are estimated from the sequence data. This done, the expected fraction of sites that are not segregating is, from (5.16) and (5.18), (5.19) where θs = 0 / s is the per-site substitution rate. For the pyrimidine mtDNA data, the observed fraction of nonsegregating sites is 180/201 = 0.896 and the observed fractions of C (labeled 1) and T (labeled 2) bases are Ï1 = 0.604 and Ï2= 0.396, respectively. Substituting these into (5.19)
CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 134 and solving for θs give the moment estimator s = 0.050. This translates into an estimate of α = 2q12 = 0.050 à 0.40 = 0.02, and an estimate of β = 2q21 = 0.03. The variance of the moment estimator is hard to compute explicitly, although the top-down simulation method for the coalescent could be used to simulate the process and therefore to construct empirical estimates of the variance. A more detailed approach to rate estimation in the finite sites model is described by Lundstrom et al. (1992a). The method is based once more on the exchangeability of the distribution of base frequencies between sites with the same mutation structure. Returning to the case in which there are K possible labelings at each site, define to be the fraction of sites in which xj individuals in the sample have nucleotide j at that site, for 1 ⤠j ⤠K. The mean of is given by (5.20) the right-hand side being given by (5.14) for the independent mutation model, or by the solution of the recursion (5.17) in the general case. A least squares method obtains estimates by minimizing the squared error function This moment estimator makes fuller use of the data than the estimate based on the number of segregating sites. An alternative estimator, also described in Lundstrom et al. (1992a), is based on the assumption that the sites are evolving independently. This approximation, which is reasonable for large substitution rates (where the between sites correlations are effectively washed out), produces a likelihood function proportional to that can then be maximized to obtain parameter estimates.
CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 135 For the mtDNA pyrimidine data, the moment method and the (independent sites) maximum likelihood method gave estimates of the C to T rate as α = 2q12 = 0.02 , and the T to C rate as β = 2q21 = 0.03. These are in close agreement with the segregating sites estimator described above. To assess the variability in the estimates of α and Ã, we used the top-down simulation described above, arriving at empirical bootstrap confidence intervals of (0.01,0.04) for a and (0.02,0.06) for β. These rates correspond to substitution probabilities of between 17 à 10â6 and 33 à 10â6 per site per generation for transitions from C to T, and between 25 à 10â6 and 50 à 10â6 per site per generation for transitions from T to C. The adequacy of these estimates depends, of course, on how well the model fits the data. To assess this, we investigated how well key features of the data are reflected in simulations of the coalescent process with the given estimated rates. As might be expected, the overall base frequencies and the number of segregating sites observed in the data are accurately reflected in the simulations. One poor aspect of the fit concerned the number of distinct sequences observed in the simulations (9 to 17 per sample) compared with the 24 observed in the data. There are several reasons why such a poor fit might be observed, among them (a) site-specific variability in mutation rates, (b) admixture between genetically distinct tribes, and (c) fluctuations in population size that are not captured in the model. Further discussion of these points can be found in Lundstrom et al. (1992a) and in the final section of the present chapter. At this point, we have come to our mathematical vignette, where population genetics theory intersects with an interesting area in combinatorics. The mathematical level of the vignette is somewhat higher than our discussion of the coalescent; readers primarily interested in aspects of the coalescent might feel justified in skipping to the final section.