National Academies Press: OpenBook
« Previous: Aligning More Than Two Sequences at a Time
Suggested Citation:"K-Best Alignments." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 76
Suggested Citation:"K-Best Alignments." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 77

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 76 While the choice of δ for a multi-alignment scoring scheme is conceptually a function of K arguments, it is often the case that δ is effectively defined in terms of an underlying pairwise scoring function δ'. For example, the sum-of-pairs score is defined as , where one must let δ(-,-) = 0. In essence, the sum-of-pairs multi-alignment score is the sum of the scores of the K(K − 1)/2 pairwise alignments it induces. Another common scheme is the consensus score, which defines δ(a1, a2,. . ., aK ) as max/min{Σ jδ'(c,ai):∈ψ {−}}. The symbol c that gives the best score is said to be the consensus symbol for the column, and the concatenation of these symbols is the consensus sequence. In effect, the consensus multi-alignment score is the sum of the scores of the K pairwise alignments of the sequences versus the consensus. The example of Figure 3.8 is such a scoring scheme where d' is the scoring scheme of Figure 3.1. While we do not show it here, the problem of determining minimal phylogenies mentioned at the start of this subsection can also be modeled as an instance of a multiple sequence alignment problem by choosing a δ for columns that suitably encodes the tree relating the sequences (Sankoff, 1975). However, the more general phylogeny problem requires that one also determine the tree that produces the minimal score. This daunting task essentially requires the exploration of the space of all possible trees with K vertices. So in practice, evolutionary biologists have put a great deal of effort into designing heuristic algorithms for the phylogeny problem, and there is much debate about which of these is best. K-Best Alignments The alignment algorithm in the section "The Basic Dynamic Programming Algorithm" above reports an optimal alignment that is clearly a function of the choice of scoring scheme. Unfortunately, biologists have not yet ascertained which scoring schemes are "correct" for a given comparison domain. This uncertainty has suggested the problem of listing all alignments near the optimum in the hope of generating the biologically correct alignment.

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 77 From the point of view of the edit graph formulation, the K-best problem is to deliver the K-best shortest source-to-sink paths, a problem much studied in the operations research literature. Indeed, there is an O(MN + KN) time and space algorithm, immediately available from this literature (Fox, 1973), that delivers the K-best paths over an edit graph. The algorithm delivers these paths/alignments in order of score, and K does not need to be known a priori: the next best alignment is available in O(N) time. The essential idea of the algorithm is to keep, at each vertex v, an ordered list of the score of the next best path to the sink through each edge out of v. The next best path is traceable using these ordered lists and is extracted, and the lists are appropriately updated. If all one desires is an enumeration, not necessarily in order of score, of all alignments that are within ε of the optimal difference D(A, B), then a simpler method is available that requires only the matrix S of the dynamic programming computation. While not any faster in time, the simpler alternative below does require only O(MN) space. One can imagine tracing back all paths from the sink to the source in a recursive fashion. The essential idea of the algorithm is to limit the traceback to only those paths of score not greater than D(A, B) + ε. Suppose one reaches vertex (i, j) and the score of the path thus far traversed from the sink to this vertex is T(i, j). Then one traces back to predecessor vertices (i − 1, j), (i − 1, j − 1), and (i, j − 1) if and only if: S(i − 1, j) + δ(ai, −) + T(i, j) ≤ D(A, B) + ε, S(i − l, j − 1) + δ (ai, bj) + T(i, j) ≤ D(A, B) + ε, S(i, j − 1) + δ (−, bj) + T(i, j) ≤ D(A, B) + ε, respectively. This procedure is very simple, space economical, and quite fast. A classic example of the need for affine gap costs was presented in a paper by Smith and Fitch (1983) comparing the α and β chicken hemoglobin chains. For a setting of the gap costs that gave the biologically correct alignment, there were 17 optimal alignments, 1,317 alignments within 5 percent of the optimum, and 20,137,655 within 20 percent of the optimum. This kind of exponential growth suggests that perhaps rather than list alignments, one should report the best possible

Next: Approximate Pattern Matching »
Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology Get This Book
×
 Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology
Buy Paperback | $80.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

As researchers have pursued biology's secrets to the molecular level, mathematical and computer sciences have played an increasingly important role—in genome mapping, population genetics, and even the controversial search for "Eve," hypothetical mother of the human race.

In this first-ever survey of the partnership between the two fields, leading experts look at how mathematical research and methods have made possible important discoveries in biology.

The volume explores how differential geometry, topology, and differential mechanics have allowed researchers to "wind" and "unwind" DNA's double helix to understand the phenomenon of supercoiling. It explains how mathematical tools are revealing the workings of enzymes and proteins. And it describes how mathematicians are detecting echoes from the origin of life by applying stochastic and statistical theory to the study of DNA sequences.

This informative and motivational book will be of interest to researchers, research administrators, and educators and students in mathematics, computer sciences, and biology.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!