National Academies Press: OpenBook
« Previous: The Basic Dynamic Programming Algorithm
Suggested Citation:"FINDING LOCAL SIMILARITIES." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 65
Suggested Citation:"FINDING LOCAL SIMILARITIES." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 66
Suggested Citation:"FINDING LOCAL SIMILARITIES." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 67
Suggested Citation:"FINDING LOCAL SIMILARITIES." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 68
Suggested Citation:"FINDING LOCAL SIMILARITIES." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 69

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 65 FINDING LOCAL SIMILARITIES We now turn to the problem of finding local alignments, that is, subsegments of A and B that align with maximal score. Local alignments can be visualized as paths in the edit graph, GA,B. Unlike the global alignment problem, the path may start and end at any vertices, not just from and Φ. Intrinsic to determining local similarities is the requirement that the scoring scheme δ be designed with a negative bias. That is, for alignment of unrelated sequences (under some suitable stochastic model of the sequences) the score of a path must on the average be negative. If this were not the case, then longer paths would tend to have higher scores, and one would generally end up reporting a global alignment between two sequences as the optimal local alignment. For example, the simple scoring scheme of Figure 3.1 is not negatively biased, whereas the scheme of Figure 3.4 is. Note that under this new scheme, the alignment is now optimal with score 3.34, whereas now has lesser score 3. In this case, the optimal alignment happened to be global, but for longer sequences this is generally not the case. For example, the best local alignment between GAGGTTGCTGAGAA and ACTCTTCTTCCTTA is the alignment score 4.34 between the underlined substrings. of score 4.34 between the underlined substrings. Figure 3.4 A local-alignments scoring scheme. The design of scoring schemes that properly weigh alignments to expose biologically meaningful local similarities is the subject of much investigation. The score of alignments between protein sequences is the sum of scores assigned to individual pairs of aligned symbols, just as for

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 66 DNA. However, since proteins are represented by combinations of 20 letters and the gap symbol, the table of scores is now 21 × 21. These scores may be chosen by users to fit the notion of similarity they have in mind for the comparison. For example, Dayhoff et al. (1983) compiled statistics on the frequency with which one amino acid would mutate into another over a fixed period of time and from these built a table of aligned symbol scores consisting of the logarithm of the normalized frequencies. Under Dayhoff's scoring scheme, the score of an alignment is a coarse estimate of the likelihood that one segment has mutated into the other. Figure 3.5 is a scaled integer approximation of Dayhoffs matrix that is much used in practice today. The basic issue in local alignment, just as in the case of global alignment, is to find a path of maximal score. However, there are more degrees of freedom in the local alignment problem: where the paths begin and where they end is not given a priori but is part of the problem. Note that if we knew the vertex (g,h) at which the best path began, we could find its score and end-vertex by setting S(g,h) to 0 and then applying the fundamental recurrence to all vertices (i, j) for which i ≥ g and j ≥ h. We can capture all potential start vertices simultaneously by modifying the central recurrence so that 0 is a term in the computation of the maximum; that is, S(i, j) = {0,S(i − 1,j − 1) + δ (ai,bj), S(i −1,j) + δ (ai,−), S(i,− 1) + δ (−,bj)}. Indeed, with this simple modification, S(i, j) is now the score of the highest-scoring path to (i, j) that begins at some vertex (g, h) for which g ≤ i and h ≤ j . The best score of a path in the edit graph is then the maximum over all vertices in the graph of their S-values. A vertex achieving this maximum is the end of an optimal path. This basic result is often referred to as the Smith-Waterman algorithm after its inventors (Smith and Waterman, 1981). The beginning of the path, the segments it aligns, and the alignment between these segments can all be delivered in linear space by further extensions of the treatment given above for global alignments. If one uses such a comparison algorithm with the scoring

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 67 Figure 3.5 A protein local-alignment scoring scheme.

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 68 scheme of Figure 3.5, one sees the three regions of similarity shown in Figure 3.6 between the sequence of the monkey somatotropin protein and the somatotropin precursor protein of a rainbow trout. Note that while in many cases the aligned symbols are identical, they do not have to be. Thus far we have presented the local similarity problem as one of finding two subsegments of the sequences that align with maximal score. But as illustrated in Figure 3.6, the ultimate goal is to expose not a single such alignment, but all the significantly conserved segments, ideally nonoverlapping as in the somatotropin example of Figure 3.5. To this end, Waterman and Eggert (1987) proposed the following simple algorithm. Find a highest- scoring local alignment by the method indicated in the previous paragraph. Eliminate every edge in the edit graph that is adjacent to a vertex on the path of this local alignment. Now find a highest-scoring path over the remaining graph. Eliminate the edges adjacent to this second-best path, and proceed to find a third-best path, and so on. In this way, one produces a series of local alignments of decreasing score whose underlying paths do not intersect. Note that this procedure may generate local alignments whose substrings overlap. Nonetheless, this procedure is very effective in identifying the biologically relevant local homologies between two sequences. As originally presented, the algorithm requires O(MN) space, but recent refinements by Chao and Miller (1994) have reduced both storage and computing time and have permitted the comparison of two sequences of length 100,000 on a conventional workstation in several hours. The output of such a problem could be displayed as a sequence of alignments, as in Figure 3.6. It is also convenient and illuminating to depict all the alignments as paths in an edit graph, as in Figure 3.2. However, as the sequences become larger and larger, one must ''step back" from the details of the edit graph. Figure 3.7 is a depiction of the edit graph of the monkey and rainbow trout somatotropin sequences of Figure 3.6 where only the paths corresponding to the three aligned segment pairs are drawn. At this level of resolution, the small gaps in the alignments of the second and third segment pairs appear as small discontinuities in paths that otherwise follow the direction of the diagonal of the edit graph grid or matrix. When the sequences become very large, say on the order of 100,000 nucleotides, then small local alignments are not seen, and neither are gaps in large alignments unless they are very large. Nonetheless, such

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 69 Figure 3.6 Conserved regions of two somatotropin proteins.

Next: Variations in Gap Cost Penalties »
Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology Get This Book
×
 Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology
Buy Paperback | $80.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

As researchers have pursued biology's secrets to the molecular level, mathematical and computer sciences have played an increasingly important role—in genome mapping, population genetics, and even the controversial search for "Eve," hypothetical mother of the human race.

In this first-ever survey of the partnership between the two fields, leading experts look at how mathematical research and methods have made possible important discoveries in biology.

The volume explores how differential geometry, topology, and differential mechanics have allowed researchers to "wind" and "unwind" DNA's double helix to understand the phenomenon of supercoiling. It explains how mathematical tools are revealing the workings of enzymes and proteins. And it describes how mathematicians are detecting echoes from the origin of life by applying stochastic and statistical theory to the study of DNA sequences.

This informative and motivational book will be of interest to researchers, research administrators, and educators and students in mathematics, computer sciences, and biology.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!