Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 99 Our conclusion is that exists with probability 1 and in the mean. Therefore optimal alignment score grows linearly with sequence length. Obviously, λ ⥠E(s(A, B)). In the simplest case of interest, the alphabet has two uniformly distributed letters and s(A, B) = O if A â B and s(A, A) = s(B, B) = 1. The alignment score is known as the longest common subsequence, and Chvátal and Sankoff (1975) wrote a seminal paper on this problem in the 1970s. In spite of much effort since then, λ remains undetermined. Deken (1979) gives bounds for λ: 0.7615 ⤠λ ⤠0.8602. Without alignment the fraction of matching letters is 0.5 by the strong law of large numbers. Not too much is known about the variance either, although Steele (1986) proves it is O(n). LOCAL SEQUENCE COMPARISONS Alignment Given Consider many independent throws of a coin with probability p of heads, where 0 < p < 1. For any p, there will be stretches where the coin comes up heads every time. What is the distribution of the length of the longest of these head runs? This maximum length is known as a "local" score; while it is a global maximum, it is a function only of the nearby tosses. A related problem is to consider sequence A1A2. . . An of letters chosen independently and from a common alphabet, {A,C,G,T} for DNA for example. The letters A and G are known as purines (R), and C and T are known as pyrimidines (Y). A two-letter alphabet is natural when grouping nucleotides by chemical similarity. In fact, there is a hypothesis that the first nucleic acid sequences were made up of just two elements, R and Y. It is natural to ask how random the distribution of R and Y is for a given sequence. We will study how large is Rn, the length of the longest run of purines R in a sequence of length n. Here an occurrence of R is a "head" for the coin tossing analogy.
HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 100 The coin tossing question was considered by Erdös and Rényi (1970), who found the strong law (4.1) Of course, one may desire more detailed information about Rn. For an example, we look at "16S" rRNA from E. coli, which is 1,541 letters in length and is known by its sedimentation rate S of 16. (The sedimentation rate is an indication of mass: the greater the mass, the higher the rate of sedimentation and the larger the value of S.) Equation (4.1) tells us that we would typically see about log1/p 1,541 = 10.6 R's in a row where p = 1/2. What if we have a head run of length 14 in our 16S sequence? Is this score extreme enough to be of note? For the statistical question of significance, we need to have a way to compute such probabilities. This is supplied by Poisson approximation. For an appropriately chosen test length t, we see an R run of length t begins at a given position with a small probability. Since the number of positions where such a run could occur is large, the number of long head runs should be approximately Poisson. Our discussion about the mathematics behind this intuition follows Goldstein (1990). This intuition is almost correct. One must first, however, adjust for the fact that runs of heads occur in "clumps"; that is, if there is a run of heads of length t beginning at position a, then with probability p there will also be a run of heads of length t beginning at position α + 1, with probability p2 a run of heads of length t beginning at position α + 2, and so forth. Hence, the total number of runs of length t or more is seen to have a compound Poisson distribution. By counting only the first such run in every clump, the occurrences now counted are no longer clumped and their number is approximately Poisson. This is an example, with average clump size 1 + p + p2 . . . = 1 / (1â p), of the "Poisson clumping heuristic," as described by Aldous (1989). Using the fact that having no runs of length t is equivalent to having the longest head run shorter than t, we can approximate the distribution of the length of the longest run of heads. In the remainder of this section we explore the approximation of this distribution by the Poisson distribution.
HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 101 Let I be an index set, and for each a α â I, let Xα be an indicator random variable, that is, Xα = 1 if an event occurs and Xα = 0 if the event does not occur. The total number of occurrences of events can be expressed as It seems intuitive that if pα = P(Xα = 1) is small, and |I|, the size of the index set, is large, then W should be approximately Poisson distributed. Certainly this is true when all the Xα, α â I , are independent. In the case of dependence, it seems plausible that the same approximation should hold when dependence is somewhat confined. For each α, we let Bα be the set of dependence for α; that is, for each α â I , assume we are given a set Bα â I such that Xα is independent of Xβ, β Bα. (4.2) Define Let Z denote a Poisson random variable with mean λ so that for k = 0, 1, 2,. . ., Classically, the Poisson distribution is the probability law of rare events. It is remarkable that a few probability distributions arise with great frequency. The three principal distributions are the binomial, the normal, and the Poisson. (See Feller (1968) for an extensive discussion of these matters.)
HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 102 Let h:Z+ â R, where Z+ = {0,1,2,. . .}, and ||h|| â¡ supkâ¥0|h(k)|. We denote the total variation distance between the distributions of W and Z by More general versions of the following theorem appear in Arratia et al. (1989, 1990). They refer to this approach as the Chen-Stein method. Theorem 4.2 Let W be the number of occurrences of dependent events, and let Z be a Poisson random variable with E(Z) = E(W) = λ. Then and in particular We first apply Poisson approximation to the distribution of the length of long success runs in Bernoulli trials. This has application to molecular sequences and provides a good illustration of the methods needed for sequence comparisons in the case when the alignment is unknown. Let C1,C2,. . . be independent Bernoulli random variables with success probability p, and let Rn be the length of the longest run of heads contained in the first n tosses. Fix a test level t and let the index set be I = {1,2,. . .,n â t + 1}; the elements of the index set will denote locations where head runs of length t or greater may begin. A head run of length t or more begins at position 1 if and only if the indicator random variable
HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 103 takes on the value 1. Now, to unclump the remaining runs define Xα = (1 â Cαâ1)CαCα+1. . .Ca+Ïâ1 for α =2,3,. . .,n â t + l For α = 2,3,. . .,n â t + 1, Xα will be 1 if and only if a run of t or more heads begins at position α, preceded by a tail. Below we calculate b1, show that b2 = 0, and find a bound for the approximation. Write now the total number of clumps of runs of length t or more as the sum of dependent indicator random variables The Poisson approximation heuristic says we should be able to approximate the distribution of W by a Poisson random variable with mean λn(t) = E(W) = pt((n â t)(l â p)+l). (4.3) In particular then, since we have as events {Rn < t} = {W=0}, the distribution function of Rn can be approximated as P(Rn < t) = P(W = 0) @ eâln(t). The test length t is dictated by requiring λ to be bounded away from 0 and â; this is equivalent to the condition that t â log1/pn is bounded. In fact, for integer t, with c defined by t = log1/p((n â t)(l â p)+l)+c, the above approximation predicts that
HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 104 P(Rn < t) @ eâλn(t) = exp(âpc), that is, that Rn â log1/p((nâ t)(lâ p) + ) has an asymptotic extreme value distribution. This is almost so; the limiting distribution is complicated by the fact that Rn can assume only integer values. However, this fact does not complicate the approximation itself. For an example we return to our problem with the 16S rRNA sequence. We model an R run of length 14 by n = 1,541 independent tosses of a fair coin. Using formula (4.3), we calculate that λn = 0.0700. Using the Poisson distribution, P(R1541 ⥠14) is approximately 1 â exp(âλn) = 0.0676 . Even so, without a bound on the error we have no way of knowing if the event is likely or not. To assess the accuracy of the above approximation, we apply Theorem 4.2. We define Bα = {β â I: |α â β| ⤠t} for all α. Since Xα is independent of {Xβ :β Bα}, condition 1 is satisfied. Furthermore, if 1 ⤠|α â Ã| ⤠t, we cannot have both Xα and Xβ equal to 1 since we require that a run begin with a tail. Therefore pαβ = 0 for β â Bα, β â α, and hence b2 = 0. In order to calculate b1 = âα âβ Bα pα pβ, we break up the sum over β â Bα into two parts, depending on whether or not p1 appears. This yields the bound b1 < λ2(2t + 1)/(n â t + 1) + 2λpt. (4.4) Theorem 4.2 now shows us that the Poisson approximation is quite accurate for the example considered above; the probability computed is correct to within b1 < 10â4, that is, 0.0699 ⤠P(Rn ⥠13) ⤠0.0701.