Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 84 chip described above takes only 0.3N microseconds to perform the Smith-Waterman algorithm with its special- purpose (and expensive) hardware. Sublinear Similarity Searches The total number N of characters of sequence in biosequence databases is growing exponentially. On the other hand, the size of the query sequences is basically fixed; for example, a protein sequence's length is bounded by 1,500 and averages 300. So designers of efficient computational methods should be principally concerned with how the time to perform such a search grows as a function of N. Yet all currently used methods take an amount of time that grows linearly in N; that is, they are O(N) algorithms. This includes not only rigorous methods such as the dynamic programming algorithms mentioned above but also the popular heuristics FASTA and BLASTA. Even the systolic array chips described above do not change this. When a database increases in size by a factor of 1,000, all of these O(N) methods take 1,000 times longer to search that database. Using the timing estimates given above, it follows that while a custom chip may take about 3 seconds to search 10 million amino acids or nucleotides, it will take 3,000 seconds, or about 50 minutes, to search 10 billion symbols. And this is the fastest of the linear methods: BLASTA will take hours, and the Smith-Waterman algorithm will take months. One could resort to massive parallelism, but such machinery is beyond the budget of most investigators, and it is unlikely that speedups due to improvements in hardware technology will keep up with sequencing rates in the next decade. What would be very desirable, if not essential, is to have search methods with computing time sublinear in N, that is, O(Nα) for some α < 1. For example, suppose there is an algorithm that takes O(N0.5 ) time, which is to say that as N grows, the time taken grows as the square root of N. For example, if the algorithm takes about 10 seconds on a 10 million symbol database, then on 10 billion symbols, it will take about 1,0000.5 â 31 times longer, or about 5 minutes. Note that while an O(N0.5) algorithm may be slower than an O(N) algorithm on 10 million symbols, it may be faster on 10 billion. Figure 3.9 illustrates this
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 85 "crossover": in this figure, the size of N at which the O(N0.5) algorithm overtakes the O(N) algorithm is approximately l à 108. Similarly, an O(N 00.2) algorithm that takes, say, 15 seconds on 10 million symbols, will take about 1 minute, or only 4 times longer, on 10 billion. To forcefully illustrate the point, we chose to let our examples be slower at N = 10 million than the competing O(N) algorithm. As will be seen in a moment, a sublinear algorithm does exist that is actually already much faster on databases of size 10 million. The other important thing to note is that we are not considering heuristic algorithms here. What we desire is nothing less than algorithms that accomplish exactly the same computational task of complete comparison as the dynamic programming algorithms, but are much faster because the computation is performed in a clever way. A recent result on the approximate string matching problem under the simple unit-cost scheme of Figure 3.1 portends the possibility of truly sublinear algorithms for the general problem. For relatively stringent matches, this new algorithm is 3 to 4 orders of magnitude more efficient Figure 3.9 Sublinear versus linear algorithms.