National Academies Press: OpenBook
« Previous: K-Allele Models
Suggested Citation:"The Finitely-Many-Sites Models." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 132
Suggested Citation:"The Finitely-Many-Sites Models." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 133
Suggested Citation:"The Finitely-Many-Sites Models." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 134
Suggested Citation:"The Finitely-Many-Sites Models." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 135

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 132 Because the sampling formula for general Q is not known explicitly, it is useful to have a way to compute it. Perhaps the simplest is an application of the bottom-up method described above. Define q(n) = P(N = n), and set ej = (0,. . .,0,1,0,. . .,0), the 1 occurring in position j. Look up the tree to the first event that occurred. This is either a mutation (with probability θ / (θ + n − 1) ) or a coalescence (with probability (n − 1) / (θ + n − 1)). By considering the configuration of the sample at this event, we see that q(n) satisfies the recursion (5.17) where q(n) = 0 if any ni < 0, and q(ei) = . The process is stationary if = πi for all i. We exploit this recursion more fully in the section below on likelihood methods. The Finitely-Many-Sites Models We now have the machinery necessary to describe the finitely-many-sites model for molecular sequence data involvingn sequences, each of s sites. The sites are thought of as completely linked, and each site is typically one of either 2 or 4 possibilities. At its grossest level, the finitely-many-sites model is "just" a K-allele model in which K = 2s or 4s. From an inference point of view, however, there are far too many parameters in such a model, and some simplification is required. The simplest null model of sequence evolution is the case in which mutations occur at a rate of θ / 2 per gene, but when a mutation occurs, a site is chosen at random to mutate and the base at that site changes according to a mutation matrix M. A slightly more general model might allow site j to mutate with probability pj, once more according to M. For a two-type classification of each site, the first model has two parameters to be estimated, and the second has s + 1. These schemes can be modified to allow for other correlation

CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 133 structures between the sites at the expense of more complicated methods of analysis. Motivated by our sequence data, we concentrate on the two-state case and discuss methods for estimating the parameters of the simplest null model. At a single site, the model behaves exactly like a 2-allele process with because the per site substitution rate is θ / s. This has the structure of (5.15), with α = m12θ / s and β = m21θ / s. The distribution of sites is exchangeable (since, conditional on the coalescent tree, mutations are laid down independently at each site; this is a simple example of a marked Poisson process argument), and in particular the sites have identical distributions. They are not of course independent because of correlations induced by the common ancestry in the coalescent. However, some simple properties of the sequences are easy to calculate. In particular, the number Sn of segregating sites has mean E[Sn] = sP(site is segregating) = s(l − g(0) − g(n)), (5.18) where g(·) is given by (5.16). The equation (5.18) provides a simple heuristic method for estimating the parameters of the process. First, the equilibrium base frequencies π1 = β / (α + β) and π2 = α / (α + β) are estimated from the sequence data. This done, the expected fraction of sites that are not segregating is, from (5.16) and (5.18), (5.19) where θs = 0 / s is the per-site substitution rate. For the pyrimidine mtDNA data, the observed fraction of nonsegregating sites is 180/201 = 0.896 and the observed fractions of C (labeled 1) and T (labeled 2) bases are π1 = 0.604 and π2= 0.396, respectively. Substituting these into (5.19)

CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 134 and solving for θs give the moment estimator s = 0.050. This translates into an estimate of α = 2q12 = 0.050 × 0.40 = 0.02, and an estimate of β = 2q21 = 0.03. The variance of the moment estimator is hard to compute explicitly, although the top-down simulation method for the coalescent could be used to simulate the process and therefore to construct empirical estimates of the variance. A more detailed approach to rate estimation in the finite sites model is described by Lundstrom et al. (1992a). The method is based once more on the exchangeability of the distribution of base frequencies between sites with the same mutation structure. Returning to the case in which there are K possible labelings at each site, define to be the fraction of sites in which xj individuals in the sample have nucleotide j at that site, for 1 ≤ j ≤ K. The mean of is given by (5.20) the right-hand side being given by (5.14) for the independent mutation model, or by the solution of the recursion (5.17) in the general case. A least squares method obtains estimates by minimizing the squared error function This moment estimator makes fuller use of the data than the estimate based on the number of segregating sites. An alternative estimator, also described in Lundstrom et al. (1992a), is based on the assumption that the sites are evolving independently. This approximation, which is reasonable for large substitution rates (where the between sites correlations are effectively washed out), produces a likelihood function proportional to that can then be maximized to obtain parameter estimates.

CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 135 For the mtDNA pyrimidine data, the moment method and the (independent sites) maximum likelihood method gave estimates of the C to T rate as α = 2q12 = 0.02 , and the T to C rate as β = 2q21 = 0.03. These are in close agreement with the segregating sites estimator described above. To assess the variability in the estimates of α and ß, we used the top-down simulation described above, arriving at empirical bootstrap confidence intervals of (0.01,0.04) for a and (0.02,0.06) for β. These rates correspond to substitution probabilities of between 17 × 10−6 and 33 × 10−6 per site per generation for transitions from C to T, and between 25 × 10−6 and 50 × 10−6 per site per generation for transitions from T to C. The adequacy of these estimates depends, of course, on how well the model fits the data. To assess this, we investigated how well key features of the data are reflected in simulations of the coalescent process with the given estimated rates. As might be expected, the overall base frequencies and the number of segregating sites observed in the data are accurately reflected in the simulations. One poor aspect of the fit concerned the number of distinct sequences observed in the simulations (9 to 17 per sample) compared with the 24 observed in the data. There are several reasons why such a poor fit might be observed, among them (a) site-specific variability in mutation rates, (b) admixture between genetically distinct tribes, and (c) fluctuations in population size that are not captured in the model. Further discussion of these points can be found in Lundstrom et al. (1992a) and in the final section of the present chapter. At this point, we have come to our mathematical vignette, where population genetics theory intersects with an interesting area in combinatorics. The mathematical level of the vignette is somewhat higher than our discussion of the coalescent; readers primarily interested in aspects of the coalescent might feel justified in skipping to the final section.

Next: Approximations for the Ewens Sampling Formula »
Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology Get This Book
×
 Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology
Buy Paperback | $80.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

As researchers have pursued biology's secrets to the molecular level, mathematical and computer sciences have played an increasingly important role—in genome mapping, population genetics, and even the controversial search for "Eve," hypothetical mother of the human race.

In this first-ever survey of the partnership between the two fields, leading experts look at how mathematical research and methods have made possible important discoveries in biology.

The volume explores how differential geometry, topology, and differential mechanics have allowed researchers to "wind" and "unwind" DNA's double helix to understand the phenomenon of supercoiling. It explains how mathematical tools are revealing the workings of enzymes and proteins. And it describes how mathematicians are detecting echoes from the origin of life by applying stochastic and statistical theory to the study of DNA sequences.

This informative and motivational book will be of interest to researchers, research administrators, and educators and students in mathematics, computer sciences, and biology.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!