otide i. The preceding relation implies that the off-diagonal elements of R can be expressed as

Therefore, the SR model is a nine-parameter model and includes many models as special cases, e.g., the models of Jukes and Cantor (33), Kimura (34), Tajima and Nei (35), Hasegawa et al. (21), and Tamura and Nei (22). The SR model has been studied by many authors (1014, 23, 36).

Consider two sequences (designated by 1 and 2) that have evolved from O, their common ancestor, t time units ago (Fig. 1). Under stationarity, time-reversibility means that the substitution process from the common ancestor O to sequences 1 and 2 is equivalent to the substitution process from 1 through O to 2 (or from 2 through O to 1), whose transition probability matrix for 2t time units is given by

P=e2tR. [1]

Let λk (k=1, 2, 3, 4) be the k-th eigenvalue of the rate matrix R; one of them is zero, say λ4=0. Let zk be the k-th eigenvalue of P. Eq. 1 implies zk=e2tλk. Gu and Li (11) showed that the evolutionary distance defined by the average number of substitutions per site (i.e.,) is given by

[2]

where constants ck are determined by the eigenmatrix of P. Eq. 2 is generally valid since all eigenvalues zk are real under the SR model (11, 37). For example, under the Jukes-Cantor model (33), z1=z2=z3=1–4p/3 and c1=c2=c3=1/4 so that Eq. 2 is reduced to d=–(3/4)ln(1–4p/3), where p is the proportion of nucleotide differences between the two sequences.

The SR distance can be estimated from the data matrix J, whose ij-th element (Jij) is the frequency of sites at which the nucleotides in the two sequences are i and j, respectively. By time-reversibility, we have Jij=πiPij. Therefore, the ij-th element of P (for 2t time units) can be estimated by ij=Jiji (i,j=1,…, 4), where πi, and Jij are easily obtained from the sequence data. Let matrix consist of ij. Its eigenvalues k (k =1,…, 3) can be computed by a standard algorithm, and the constants are given by (k=1, 2, 3), where uik and vkj are the elements of the corresponding eigenmatrix U and its inverse matrix V, respectively. For

FIG. 1. Two DNA sequences diverged t time units ago.

details, see Saccone et al. (38), Gu and Li (11), and Li and Gu (39). The sampling variance of d and the variance-covariance matrix for more than two DNA sequences can be found in Gu and Li (11).

Eq. 2 can be used to define many additive distances by choosing appropriate constants ck (Table 1), e.g., the number of nucleotide substitutions per site (K), the number of transitional substitutions per site (A), the number of transversional substitutions per site (B), and the number of substitutions from nucleotides i to j (Dij). These distance measures are useful for phylogenetic analysis and molecular clock testing.

The SRV Model. Rate variation among sites can be incorporated into the SR model by assuming rij=aiju, where aij is a constant and u varies according to a gamma distribution

[3]

with mean ū=α/β; α is the shape parameter and determines the degree of rate variation. Under this model, the (mean) transition probability matrix P for 2t time units is given by

[4]

where I is the identity matrix and the mean rate matrix =ūA where matrix A consists of aij (11). From Eq. 4, one can show that the k-th eigenvalue of P is given by

[5]

where λk is the k-th eigenvalue of R. It follows that the evolutionary distance under the SRV model is given by

[6]

The constants ck are determined in the same manner as above (Table 1). Note that Eq. 4 reduces to Eq. 1 and Eq. 6 to Eq. 2 as α→∞, i.e., the substitution rate is uniform among sites.

Furthermore, Eq. 6 can be generalized to any distribution f(u) for the rate variation among sites. Let G(s)=(u)du be the moment-generating function of f(u). Gu and Li (11) showed that zk=G(2λkt), k=1, 2, 3, 4. Thus, the general additive distance is given by

[7]

where G–1 is the inverse function of the moment-generating function G. For example, consider the invariant+gamma model (26, 4041): (i) for a given site, the probability of being invariable (i.e., u=0) is θ, whereas the probability of being variable is 1—θ; and (ii) among the sites that are variable, the substitution rate follows a gamma distribution. By applying Eq.

Table 1. The constants ck in the general SR or SRV distance

 

K is the number of substitutions per site; A is the number of transitional substitutions per site; B is the number of tranversional substitutions per site, and Dij is the number of substitutions from nucleotides i to j per site. The subscripts j≠iTs and jiTv mean that the differences between nucleotides i and j are transitional and transversional, respectively.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement