National Academies Press: OpenBook
« Previous: Alignment Unknown
Suggested Citation:"Alignment Given." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 99
Suggested Citation:"Alignment Given." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 100
Suggested Citation:"Alignment Given." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 101
Suggested Citation:"Alignment Given." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 102
Suggested Citation:"Alignment Given." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 103
Suggested Citation:"Alignment Given." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 104

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 99 Our conclusion is that exists with probability 1 and in the mean. Therefore optimal alignment score grows linearly with sequence length. Obviously, λ ≥ E(s(A, B)). In the simplest case of interest, the alphabet has two uniformly distributed letters and s(A, B) = O if A ≠ B and s(A, A) = s(B, B) = 1. The alignment score is known as the longest common subsequence, and Chvátal and Sankoff (1975) wrote a seminal paper on this problem in the 1970s. In spite of much effort since then, λ remains undetermined. Deken (1979) gives bounds for λ: 0.7615 ≤ λ ≤ 0.8602. Without alignment the fraction of matching letters is 0.5 by the strong law of large numbers. Not too much is known about the variance either, although Steele (1986) proves it is O(n). LOCAL SEQUENCE COMPARISONS Alignment Given Consider many independent throws of a coin with probability p of heads, where 0 < p < 1. For any p, there will be stretches where the coin comes up heads every time. What is the distribution of the length of the longest of these head runs? This maximum length is known as a "local" score; while it is a global maximum, it is a function only of the nearby tosses. A related problem is to consider sequence A1A2. . . An of letters chosen independently and from a common alphabet, {A,C,G,T} for DNA for example. The letters A and G are known as purines (R), and C and T are known as pyrimidines (Y). A two-letter alphabet is natural when grouping nucleotides by chemical similarity. In fact, there is a hypothesis that the first nucleic acid sequences were made up of just two elements, R and Y. It is natural to ask how random the distribution of R and Y is for a given sequence. We will study how large is Rn, the length of the longest run of purines R in a sequence of length n. Here an occurrence of R is a "head" for the coin tossing analogy.

HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 100 The coin tossing question was considered by Erdös and Rényi (1970), who found the strong law (4.1) Of course, one may desire more detailed information about Rn. For an example, we look at "16S" rRNA from E. coli, which is 1,541 letters in length and is known by its sedimentation rate S of 16. (The sedimentation rate is an indication of mass: the greater the mass, the higher the rate of sedimentation and the larger the value of S.) Equation (4.1) tells us that we would typically see about log1/p 1,541 = 10.6 R's in a row where p = 1/2. What if we have a head run of length 14 in our 16S sequence? Is this score extreme enough to be of note? For the statistical question of significance, we need to have a way to compute such probabilities. This is supplied by Poisson approximation. For an appropriately chosen test length t, we see an R run of length t begins at a given position with a small probability. Since the number of positions where such a run could occur is large, the number of long head runs should be approximately Poisson. Our discussion about the mathematics behind this intuition follows Goldstein (1990). This intuition is almost correct. One must first, however, adjust for the fact that runs of heads occur in "clumps"; that is, if there is a run of heads of length t beginning at position a, then with probability p there will also be a run of heads of length t beginning at position α + 1, with probability p2 a run of heads of length t beginning at position α + 2, and so forth. Hence, the total number of runs of length t or more is seen to have a compound Poisson distribution. By counting only the first such run in every clump, the occurrences now counted are no longer clumped and their number is approximately Poisson. This is an example, with average clump size 1 + p + p2 . . . = 1 / (1− p), of the "Poisson clumping heuristic," as described by Aldous (1989). Using the fact that having no runs of length t is equivalent to having the longest head run shorter than t, we can approximate the distribution of the length of the longest run of heads. In the remainder of this section we explore the approximation of this distribution by the Poisson distribution.

HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 101 Let I be an index set, and for each a α ∈ I, let Xα be an indicator random variable, that is, Xα = 1 if an event occurs and Xα = 0 if the event does not occur. The total number of occurrences of events can be expressed as It seems intuitive that if pα = P(Xα = 1) is small, and |I|, the size of the index set, is large, then W should be approximately Poisson distributed. Certainly this is true when all the Xα, α ∈ I , are independent. In the case of dependence, it seems plausible that the same approximation should hold when dependence is somewhat confined. For each α, we let Bα be the set of dependence for α; that is, for each α ∈ I , assume we are given a set Bα ⊂ I such that Xα is independent of Xβ, β Bα. (4.2) Define Let Z denote a Poisson random variable with mean λ so that for k = 0, 1, 2,. . ., Classically, the Poisson distribution is the probability law of rare events. It is remarkable that a few probability distributions arise with great frequency. The three principal distributions are the binomial, the normal, and the Poisson. (See Feller (1968) for an extensive discussion of these matters.)

HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 102 Let h:Z+ → R, where Z+ = {0,1,2,. . .}, and ||h|| ≡ supk≥0|h(k)|. We denote the total variation distance between the distributions of W and Z by More general versions of the following theorem appear in Arratia et al. (1989, 1990). They refer to this approach as the Chen-Stein method. Theorem 4.2 Let W be the number of occurrences of dependent events, and let Z be a Poisson random variable with E(Z) = E(W) = λ. Then and in particular We first apply Poisson approximation to the distribution of the length of long success runs in Bernoulli trials. This has application to molecular sequences and provides a good illustration of the methods needed for sequence comparisons in the case when the alignment is unknown. Let C1,C2,. . . be independent Bernoulli random variables with success probability p, and let Rn be the length of the longest run of heads contained in the first n tosses. Fix a test level t and let the index set be I = {1,2,. . .,n − t + 1}; the elements of the index set will denote locations where head runs of length t or greater may begin. A head run of length t or more begins at position 1 if and only if the indicator random variable

HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 103 takes on the value 1. Now, to unclump the remaining runs define Xα = (1 − Cα−1)CαCα+1. . .Ca+τ−1 for α =2,3,. . .,n − t + l For α = 2,3,. . .,n − t + 1, Xα will be 1 if and only if a run of t or more heads begins at position α, preceded by a tail. Below we calculate b1, show that b2 = 0, and find a bound for the approximation. Write now the total number of clumps of runs of length t or more as the sum of dependent indicator random variables The Poisson approximation heuristic says we should be able to approximate the distribution of W by a Poisson random variable with mean λn(t) = E(W) = pt((n − t)(l − p)+l). (4.3) In particular then, since we have as events {Rn < t} = {W=0}, the distribution function of Rn can be approximated as P(Rn < t) = P(W = 0) @ e−ln(t). The test length t is dictated by requiring λ to be bounded away from 0 and ∞; this is equivalent to the condition that t − log1/pn is bounded. In fact, for integer t, with c defined by t = log1/p((n − t)(l − p)+l)+c, the above approximation predicts that

HEARING DISTANT ECHOES: USING EXTREMAL STATISTICS TO PROBE EVOLUTIONARY ORIGINS 104 P(Rn < t) @ e−λn(t) = exp(−pc), that is, that Rn − log1/p((n− t)(l− p) + ) has an asymptotic extreme value distribution. This is almost so; the limiting distribution is complicated by the fact that Rn can assume only integer values. However, this fact does not complicate the approximation itself. For an example we return to our problem with the 16S rRNA sequence. We model an R run of length 14 by n = 1,541 independent tosses of a fair coin. Using formula (4.3), we calculate that λn = 0.0700. Using the Poisson distribution, P(R1541 ≥ 14) is approximately 1 − exp(−λn) = 0.0676 . Even so, without a bound on the error we have no way of knowing if the event is likely or not. To assess the accuracy of the above approximation, we apply Theorem 4.2. We define Bα = {β ∈ I: |α − β| ≤ t} for all α. Since Xα is independent of {Xβ :β Bα}, condition 1 is satisfied. Furthermore, if 1 ≤ |α − ß| ≤ t, we cannot have both Xα and Xβ equal to 1 since we require that a run begin with a tail. Therefore pαβ = 0 for β ∈ Bα, β ≠ α, and hence b2 = 0. In order to calculate b1 = ∑α ∑β Bα pα pβ, we break up the sum over β ∈ Bα into two parts, depending on whether or not p1 appears. This yields the bound b1 < λ2(2t + 1)/(n − t + 1) + 2λpt. (4.4) Theorem 4.2 now shows us that the Poisson approximation is quite accurate for the example considered above; the probability computed is correct to within b1 < 10−4, that is, 0.0699 ≤ P(Rn ≥ 13) ≤ 0.0701.

Next: Alignment Unknown »
Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology Get This Book
×
 Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology
Buy Paperback | $80.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

As researchers have pursued biology's secrets to the molecular level, mathematical and computer sciences have played an increasingly important role—in genome mapping, population genetics, and even the controversial search for "Eve," hypothetical mother of the human race.

In this first-ever survey of the partnership between the two fields, leading experts look at how mathematical research and methods have made possible important discoveries in biology.

The volume explores how differential geometry, topology, and differential mechanics have allowed researchers to "wind" and "unwind" DNA's double helix to understand the phenomenon of supercoiling. It explains how mathematical tools are revealing the workings of enzymes and proteins. And it describes how mathematicians are detecting echoes from the origin of life by applying stochastic and statistical theory to the study of DNA sequences.

This informative and motivational book will be of interest to researchers, research administrators, and educators and students in mathematics, computer sciences, and biology.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!