National Academies Press: OpenBook
« Previous: Heuristic Algorithms
Suggested Citation:"Sublinear Similarity Searches." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 84
Suggested Citation:"Sublinear Similarity Searches." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 85

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 84 chip described above takes only 0.3N microseconds to perform the Smith-Waterman algorithm with its special- purpose (and expensive) hardware. Sublinear Similarity Searches The total number N of characters of sequence in biosequence databases is growing exponentially. On the other hand, the size of the query sequences is basically fixed; for example, a protein sequence's length is bounded by 1,500 and averages 300. So designers of efficient computational methods should be principally concerned with how the time to perform such a search grows as a function of N. Yet all currently used methods take an amount of time that grows linearly in N; that is, they are O(N) algorithms. This includes not only rigorous methods such as the dynamic programming algorithms mentioned above but also the popular heuristics FASTA and BLASTA. Even the systolic array chips described above do not change this. When a database increases in size by a factor of 1,000, all of these O(N) methods take 1,000 times longer to search that database. Using the timing estimates given above, it follows that while a custom chip may take about 3 seconds to search 10 million amino acids or nucleotides, it will take 3,000 seconds, or about 50 minutes, to search 10 billion symbols. And this is the fastest of the linear methods: BLASTA will take hours, and the Smith-Waterman algorithm will take months. One could resort to massive parallelism, but such machinery is beyond the budget of most investigators, and it is unlikely that speedups due to improvements in hardware technology will keep up with sequencing rates in the next decade. What would be very desirable, if not essential, is to have search methods with computing time sublinear in N, that is, O(Nα) for some α < 1. For example, suppose there is an algorithm that takes O(N0.5 ) time, which is to say that as N grows, the time taken grows as the square root of N. For example, if the algorithm takes about 10 seconds on a 10 million symbol database, then on 10 billion symbols, it will take about 1,0000.5 ≈ 31 times longer, or about 5 minutes. Note that while an O(N0.5) algorithm may be slower than an O(N) algorithm on 10 million symbols, it may be faster on 10 billion. Figure 3.9 illustrates this

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 85 "crossover": in this figure, the size of N at which the O(N0.5) algorithm overtakes the O(N) algorithm is approximately l × 108. Similarly, an O(N 00.2) algorithm that takes, say, 15 seconds on 10 million symbols, will take about 1 minute, or only 4 times longer, on 10 billion. To forcefully illustrate the point, we chose to let our examples be slower at N = 10 million than the competing O(N) algorithm. As will be seen in a moment, a sublinear algorithm does exist that is actually already much faster on databases of size 10 million. The other important thing to note is that we are not considering heuristic algorithms here. What we desire is nothing less than algorithms that accomplish exactly the same computational task of complete comparison as the dynamic programming algorithms, but are much faster because the computation is performed in a clever way. A recent result on the approximate string matching problem under the simple unit-cost scheme of Figure 3.1 portends the possibility of truly sublinear algorithms for the general problem. For relatively stringent matches, this new algorithm is 3 to 4 orders of magnitude more efficient Figure 3.9 Sublinear versus linear algorithms.

Next: OPEN PROBLEMS »
Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology Get This Book
×
 Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology
Buy Paperback | $80.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

As researchers have pursued biology's secrets to the molecular level, mathematical and computer sciences have played an increasingly important role—in genome mapping, population genetics, and even the controversial search for "Eve," hypothetical mother of the human race.

In this first-ever survey of the partnership between the two fields, leading experts look at how mathematical research and methods have made possible important discoveries in biology.

The volume explores how differential geometry, topology, and differential mechanics have allowed researchers to "wind" and "unwind" DNA's double helix to understand the phenomenon of supercoiling. It explains how mathematical tools are revealing the workings of enzymes and proteins. And it describes how mathematicians are detecting echoes from the origin of life by applying stochastic and statistical theory to the study of DNA sequences.

This informative and motivational book will be of interest to researchers, research administrators, and educators and students in mathematics, computer sciences, and biology.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!