National Academies Press: OpenBook
« Previous: COMPARING ONE SEQUENCE AGAINST A DATABASE
Suggested Citation:"Heuristic Algorithms." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 82
Suggested Citation:"Heuristic Algorithms." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 83

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 82 A similarity search of a database takes a relatively short query sequence of a protein or DNA fragment and searches every entry in the database for evidence of similarity with the query. In protein databases, the query sequence and the entries in the database are typically of similar sizes. In DNA databases, the entries are typically much longer than the query sequence, and one is looking for subsegments of the entry that match the query. Heuristic Algorithms The problem of searching for protein similarities efficiently has led many investigators to abandon dynamic programming algorithms (for which the size of the problem has become too large) and instead consider designing very fast heuristic procedures: simple, often ad hoc, computational procedures that produce answers that are "nearly" correct with respect to a formally stated optimization criterion. One of the most popular database searching tools of this genre is FASTA (Lipman and Pearson, 1985). FASTA looks for entries that share a significant number of short identical subsequences of symbols with the query sequence. Any entry meeting this criterion is then compared via dynamic programming with the query sequence. In this way, the vast majority of entries are eliminated from consideration quickly. FASTA reports most of the alignments that would be identified by an equivalent dynamic programming calculation, but it misses some matches and also reports some spurious matches. On the other hand, FASTA is very fast. A more recently developed heuristic algorithm is BLASTA (Altschul et al., 1990). BLASTA is faster than FASTA but is capable of detecting biologically meaningful similarities with accuracy comparable to that of FASTA. Given a query A and an entry B, BLASTA searches for segment pairs of high score. A segment pair is a substring from A and a substring from B of equal length, and the score of the pair is that of the no-gap alignment between them. One can argue that the presence of a high-scoring segment pair or pairs is evidence of functional similarity between proteins, because insertion and deletion events tend to significantly change the shape of a protein and hence its function. Note that segment pairs embody a local similarity concept. What is particularly useful is that there is a formula for the probability that two sequences have

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 83 a segment pair above a certain score. Thus BLASTA can give an assessment of the statistical significance of any match that it reports. For a given threshold, T, BLASTA returns to the user all database entries that have a segment pair with the query of score greater than T ranked according to probability. BLASTA may miss some such matches, although in practice it misses very few. The central idea used in BLASTA is the notion of a neighborhood. The t-neighborhood of a sequence S is the set of all sequences that align with S with score better than t. In the case of BLASTA, the t -neighborhood of S is exactly those sequences of equal length that form a segment pair of score higher than t under the Dayhoff scoring scheme (see Figure 3.5). This concept suggests a simple strategy for finding all entries that have segment pairs of length k and score greater than t with the query: generate the set of all sequences that are in the t -neighborhood of some k-substring of the query and see if an entry contains one of these strings. Scanning for an exact match to one of the strings in the neighborhood can be performed very efficiently: on the order of 0.5 million characters per second on a 20 SPECint computer. Unfortunately, for the general problem, the length of the segment pair is not known in advance, and even more devastating is the fact that the number of sequences in a neighborhood grows exponentially in both k and t, rendering it impractical for reasonable values of t. To circumvent this difficulty, BLASTA uses the fast scanning strategy above to find short segment pairs of length k above a score t, and then checks each of these to see if they are a portion of a segment pair of score T or greater. This approach is heuristic (that is, may miss some segment pairs) because it is possible for every length k subsegment pair of a segment pair of score T to have score less than t. Nonetheless, with k = 4 and t = 17 such misses are very rare, and BLASTA takes about 3 seconds for every 1 million characters of data searched. To get an idea of the relative efficiency of various similarity searching approaches, consider the following rough timing estimates for a typical 20 SPECint workstation and a search against a typical protein query. The dynamic programming algorithm for local similarities presented above (also known as the Smith-Waterman algorithm) takes roughly 1000.0N microseconds to search a database with a total of N characters in it. On the other hand, FASTA takes 20.0N microseconds, and BLASTA only about 2.0N microseconds. At the other end of the spectrum, the systolic array

Next: Sublinear Similarity Searches »
Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology Get This Book
×
 Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology
Buy Paperback | $80.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

As researchers have pursued biology's secrets to the molecular level, mathematical and computer sciences have played an increasingly important role—in genome mapping, population genetics, and even the controversial search for "Eve," hypothetical mother of the human race.

In this first-ever survey of the partnership between the two fields, leading experts look at how mathematical research and methods have made possible important discoveries in biology.

The volume explores how differential geometry, topology, and differential mechanics have allowed researchers to "wind" and "unwind" DNA's double helix to understand the phenomenon of supercoiling. It explains how mathematical tools are revealing the workings of enzymes and proteins. And it describes how mathematicians are detecting echoes from the origin of life by applying stochastic and statistical theory to the study of DNA sequences.

This informative and motivational book will be of interest to researchers, research administrators, and educators and students in mathematics, computer sciences, and biology.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!