Read "Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology" at NAP.edu

« Previous: Bottom-up

Page 127 Cite

Suggested Citation:"The Infinitely-Many-Sites Model." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.

Page 128 Cite

Page 129 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 127 (5.10) The only case not covered by equation (5.10) is the one in which a = (0,. . .,0,1). In this case the previous event had to be a coalescence, and so (5.11) The persistent reader will be able to verify that Pn(a) given by the Ewens sampling formula (5.4) does indeed satisfy equations (5.10) and (5.11). The Infinitely-Many-Sites Model The infinitely-many-sites model of Kimura (1969) and Watterson (1975) is the simplest description of the evolution of a population of DNA sequences. The sites in the sequences are completely linked, and each mutation that occurs in the ancestral tree of the sample introduces a new segregating site into the sample. In this process, each new mutation occurs at a site not previously segregatingânew mutations arise just once. It follows that at each segregating site, the sample may be classified as type 0 (ancestral) or type 1 (mutant). Of course, in practice we do not know which is which. The sequences in the sample may now be described by strings of 0s

CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 128 and 1s. If distinct sequences are treated as alleles, then the sampling theory is reduced to that covered by the Ewens sampling formula. The number Sn of segregating sites is an important summary statistic for the sample. Since each new mutation produces a segregating site, it follows that Sn = Âµn, the number of mutations in the ancestral tree. The mean and variance of Sn are therefore given by (5.1) and (5.2), respectively. The number of segregating sites has been studied extensively for many variants of the infinitely-many-sites process, including the effects of selection and recombination, for example. Hudson (1991) gives an accessible summary of this work. When there is no recombination, the fundamental results have been established by Watterson (1975), Ethier and Griffiths (1987), and Griffiths (1989). Watterson (1975) parlayed the moments of Sn into an unbiased estimator of Î¸, namely, (5.12) with variance where . Note that does not depend on knowing which type at a site is ancestral and does not make full use of the data. For the pyrimidine data, there are 21 segregating sites, giving an approximate 95 percent confidence interval for Î¸ of 4.46 Â± 3.10. This should be compared to the estimate of 10.62 Â± 6.29 obtained from the Ewens sampling formula. Now think of the data as an n Ã s matrix of 0s and 1s, s being the number of segregating sites in the sample. When 0 is known to be ancestral in each site, Griffiths (1987) established that the data are consistent with the

CALIBRATING THE CLOCK: USING STOCHASTIC PROCESSES TO MEASURE THE RATE OF EVOLUTION 129 infinitely-many-sites model as long as in any set of three rows of the matrix, at most one of the patterns occurs. This is equivalent to the pairwise compatibility condition for binary characters established by Estabrook et al. (1976) and McMorris (1977): two sites are compatible if two or fewer of the patterns 01, 10, 11 occur. When the ancestral state is unknown, an analogous result holds: two sites are compatible if at most three of the patterns 00, 01, 10, 11 occur. This translates into a simple test of whether a given set of binary site data is consistent with the infinitely- many-sites model. If in all pairs of columns at most three of the patterns 00, 01, 10, 11 occur, then there is at least one labeling of the sites that is consistent. McMorris (1977) proved that consistent data remain consistent when the most frequent type is taken as ancestral. In practice, back mutations and recombination make most molecular data inconsistent with this model. However, it is worthwhile to look for maximal subsets of sites that are consistent, as this provides a way to identify regions of the sequence with simple structure. For the pyrimidine data described in Table 5.1, the maximal consistent set has 14 sites, those in positions 2-8, 11-12, 14-16, and 20-21. The remaining 7 sites have some inconsistencies, attributable to back substitutions, for example. Of the 214 = 16,384 possible relabelings of the consistent set, just 16 are consistent. Each of these labelings is associated with a genealogical tree that describes the relationships between the mutations in the coalescent. The precise definition of the (equivalence class of) trees is given in Ethier and Griffiths (1987) and Griffiths (1989). The tree is equivalent to those built using compatibility methods for binary characters; see Felsenstein (1982, pp. 389-393) for a detailed discussion and references. The nodes in the tree represent the mutations that have generated the segregating sites, and the tips represent the sequences. A convenient algorithm for finding these trees is provided by Griffiths (1987), who also shows (Griffiths, 1989) how the probability of a tree with a given ancestral labeling can be computed under the infinitely-many-sites model. Griffiths' program PTREE can then be used

Next: K-Allele Models »

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

No Thanks

Take a Tour »

Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology (1995)

Chapter: The Infinitely-Many-Sites Model

Welcome to OpenBook!

Get Email Updates