National Academies Press: OpenBook
« Previous: The Duality Between Similarity and Difference Measures
Suggested Citation:"Aligning More Than Two Sequences at a Time." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 73
Suggested Citation:"Aligning More Than Two Sequences at a Time." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 74
Suggested Citation:"Aligning More Than Two Sequences at a Time." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 75

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 73 cost alignment between sequences A and B. In honor of its inventor, this score is formally known as the generalized Levenshtein measure or distance between sequences A and B. Indeed the measure, D, between sequences forms a metric space over sequences if the underlying scoring function δ forms a metric space over the underlying alphabet. Thus calling this measure a distance is formally correct for a wide class of scoring schemes δ. Immediately note that the distance and similarity perspectives are complementary. To solve a difference problem, we need only revise our previous discussions by replacing maximum with minimum in every sentence and formula. Also, one could simply take a δ for a difference problem and multiply every score by −1. Applying the similarity algorithm with the modified scores would produce optimal alignments for the original difference problem, and multiplying the resultant similarity score by −1 would give the distance between the two sequences. Aligning More Than Two Sequences at a Time Molecular biologists are frequently interested in comparing more than two sequences simultaneously. For instance, given a number of sequences of the same functionality, it is much more likely that the similarity that gives this common function will be more evident among the group than among two sequences from the group. A closely related problem is to discover the evolutionary relationships between a set of sequences by constructing an evolutionary tree, orphylogeny , that minimizes the evolutionary changes that must have taken place along each branch of the tree. A third application for aligning a collection of sequences is to correct errors in the "raw" experimental data obtained in DNA sequencing experiments. Typically, 1 to 10 percent of the symbols in a sequenced fragment are incorrect, missing, or spurious. These errors are detected and corrected by sequencing a given stretch several times and then forming a consensus by aligning the sequences. Figure 3.8 illustrates a multi- alignment of such sequence data.

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 74 Figure 3.8 A multi-alignment of five DNA sequences. Suppose we wish to align K sequences A1, A2,. . ., AK, where is of length Ni. As for the basic problem, we wish to arrange the sequences into a tableau using dashes to force the alignment of certain characters in given columns. For example, in Figure 3.8 the dashes are placed so as to arrange columns consisting of primarily one symbol. For each column, the consensus of the column is the symbol that occurs the greatest number of times in that column. Concatenating these consensus characters together, ignoring dashes, gives the consensus sequence for the five experimental trials. As for pairwise alignments, each column of K symbols of the multi-alignment is scored according to a user-supplied function δ. For example, if δ is the number of symbols in the column not equal to the majority symbol of the column (which can be a dash), then the multi-alignment of Figure 3.8 has score 13, and this is the minimum possible score over all possible multi-alignments of the five sequences. The problem of finding a maximum (minimum)-scoring alignment among K sequences can be solved by extending the dynamic programming recurrence for the basic problem from a recurrence over a two-dimensional matrix to a recurrence over a K-dimensional matrix.Let i = (i1, i2,. . .,iK ) be a vector in K-dimensional Cartesian space. Now we compute a K-dimensional array S, where S(i) is the score of the best alignment among the prefix sequences . The central recurrence now becomes

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 75 In terms of an edit graph model, imagine a grid of vertices in K-dimensional space where each vertex i has 2K − 1 edges directed into it, each corresponding to a column that when appended to the alignment for the edge's tail gives rise to the alignment for the prefix sequences represented by i. Computing the S values in some topological ordering requires a total of O(NK) time, where N = maxj Nl. While multiple sequence comparison algorithms of this genre are conceptually straightforward, they take an exponential amount of time in K and are thus generally impractical for K > 3. Multiple sequence comparison has been shown to be NP-complete (Garey and Johnson, 1979), which means that it is almost surely the case that any algorithm for this problem must exhibit time behavior that is exponential in K. Thus many authors have sought heuristic approximations, the most popular of which is to take O(K N2) time to compute all pairwise optimal alignments between the K sequences, and then produce a multiple sequence alignment by merging these pairwise alignments. Note that any multiple sequence alignment induces an alignment between a given pair of sequences (take the two rows of the tableau and remove any columns consisting of just dashes). However, given all of the possible K(K − 1)/2 pairwise alignments between K sequences, it is almost always impossible to arrange a multi-alignment consistent with them all. Try, for example, merging the best pairwise alignments among ACG, CGA, and GAC. But, given any K − 1 alignments relating all the sequences (that is, a spanning tree of the complete graph of sequence pairs), it is always possible to do so. Feng and Doolittle (1987) compare a number of methods based on this approach. The most recent algorithms utilize the natural choice of the K − 1 alignments whose scores sum to the minimal possible amount (that is, a minimum spanning tree of the complete graph of sequence pairs). However, such merges do not always lead to optimal alignments, as is illustrated by the following example:

Next: K-Best Alignments »
Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology Get This Book
×
Buy Paperback | $80.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

As researchers have pursued biology's secrets to the molecular level, mathematical and computer sciences have played an increasingly important role—in genome mapping, population genetics, and even the controversial search for "Eve," hypothetical mother of the human race.

In this first-ever survey of the partnership between the two fields, leading experts look at how mathematical research and methods have made possible important discoveries in biology.

The volume explores how differential geometry, topology, and differential mechanics have allowed researchers to "wind" and "unwind" DNA's double helix to understand the phenomenon of supercoiling. It explains how mathematical tools are revealing the workings of enzymes and proteins. And it describes how mathematicians are detecting echoes from the origin of life by applying stochastic and statistical theory to the study of DNA sequences.

This informative and motivational book will be of interest to researchers, research administrators, and educators and students in mathematics, computer sciences, and biology.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!