Page 56

Chapter 3—
Seeing Conserved Signals:
Using Algorithms to Detect Similarities between Biosequences

Eugene W. Myers
University of Arizona

The sequence of amino acids in a protein determines its three-dimensional shape, which in turn confers its function. Segments of the protein that are critical to its function resist evolutionary pressures because mutations of such segments are often lethal to the organism. These critical "active sites" tend to be conserved over time and so can be found in many organisms and proteins that have similar function. Analogously, functionally important segments of an organism's DNA tend to be conserved and to recur as common motifs. In this chapter, the author introduces algorithms for comparing DNA and protein sequences to reveal similar regions. Particular attention is given to the problem of searching a large database of catalogued sequences for regions similar to a newly determined sequence of unknown function.

Since the advent of deoxyribonucleic acid (DNA) sequencing technologies in the late 1970s, the amount of data about the protein and DNA sequence of humans and other organisms has been growing at an exponential rate. It is estimated that by the turn of the century there will be terabytes of such biosequence information, including DNA sequences of entire human chromosomes. Databases of these sequences will contain a wealth of information about the nature of life at the molecular level if we can decipher their meaning.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 56
Page 56 Chapter 3— Seeing Conserved Signals: Using Algorithms to Detect Similarities between Biosequences Eugene W. Myers University of Arizona The sequence of amino acids in a protein determines its three-dimensional shape, which in turn confers its function. Segments of the protein that are critical to its function resist evolutionary pressures because mutations of such segments are often lethal to the organism. These critical "active sites" tend to be conserved over time and so can be found in many organisms and proteins that have similar function. Analogously, functionally important segments of an organism's DNA tend to be conserved and to recur as common motifs. In this chapter, the author introduces algorithms for comparing DNA and protein sequences to reveal similar regions. Particular attention is given to the problem of searching a large database of catalogued sequences for regions similar to a newly determined sequence of unknown function. Since the advent of deoxyribonucleic acid (DNA) sequencing technologies in the late 1970s, the amount of data about the protein and DNA sequence of humans and other organisms has been growing at an exponential rate. It is estimated that by the turn of the century there will be terabytes of such biosequence information, including DNA sequences of entire human chromosomes. Databases of these sequences will contain a wealth of information about the nature of life at the molecular level if we can decipher their meaning.

OCR for page 56
Page 57 Proteins and DNA sequences are polymers consisting of a chain of monomers with a common backbone substructure that links them together. In the case of DNA, there are 4 types of monomers, the nucleotides, each having a different side chain. For proteins, there are 20 types of monomers, the amino acids. With just a few exceptions, the sequence of monomers, that is, the primary structure, of a given protein or DNA strand completely determines the three-dimensional shape of the biopolymer. Because the function of a molecule is determined by the position of its atoms in space, this almost perfect correlation between sequence and structure implies that to know the function of a biopolymer, it in principle suffices to know its primary sequence. The primary sequence of a DNA segment is denoted by a string consisting of the four letters A,C,G, and T. Analogously, the primary sequence of a protein is denoted by a string consisting of 20 letters of the alphabet, one for each type of amino acid. In principle, these strings of symbols encode everything one needs to know about the protein or DNA strand in question. If the primary sequences of two proteins are similar, then it is reasonable to conjecture that they perform the same function. Because DNA's principal role is one of encoding information (including all of an organism's proteins), the similarity of two segments of DNA suggests that they code similar things. Mutation in a DNA or protein sequence is a natural evolutionary process. Errors in the replication of DNA can cause a change in the nucleotide at a given position. Less often, a nucleotide is deleted or inserted. If the mutation occurs in a region of DNA that codes for protein, these changes cause related changes in the primary sequence and, hence, the shape and activity of the protein. The impact of a particular mutation depends on the degree to which the original and new amino acid sequences differ in their physical and chemical properties. Mutations that result in proteins that are so altered that they function improperly or not at all tend to be lethal to the organism. Nature is biased against mutations in those critical regions central to a protein's function and is more lenient toward changes in other regions. Similarity of DNA sequences is a clue to common evolutionary origin. If two proteins in two organisms evolved from a common precursor, one will generally find highly similar segments, reflecting strongly conserved critical regions. If the proteins are very recent derivatives, one might expect to see similarity over the entire length of the sequences. While

OCR for page 56
Page 58 proteins can be similar because of evolution from a common precursor, similarity of protein sequences can also be a clue to common function independent of evolutionary considerations. It appears that nature not only conserves the critical parts of a protein's conformation and function, but also reuses such motifs as modular units in fashioning the spectrum of known proteins. One finds strong similarities between segments of proteins that have similar functions. A strong similarity between the v-sis oncogene and a growth-stimulating hormone was the key to discovering that the v-sis oncogene causes cancer by deregulating cell growth. In that case, the similarity involved the entirety of the sequence. In other cases, functionally related proteins are similar only in segments corresponding to active sites or other functionally critical stretches. Finding Global Similarities To illustrate the underlying techniques of sequence comparison, we begin with a simple, core problem of finding the best alignment between the entirety of two sequences. Such an alignment is called a global alignment because it aligns the entire sequences, as opposed to a local alignment, which aligns portions of the sequences. As an example, consider finding the best global alignment of A = ATTACG and B = ATATCG under the following scoring scheme. A letter aligned with the same letter has a score of 1. A letter aligned with any different letter or a gap has a score of 0. The total score is the sum of the scores for the alignment. A matrix depicting this "unit-cost" scoring scheme is shown in Figure 3.1. Under this unit-cost scheme, the score of an alignment is equal to the number of identical aligned characters. The obvious alignment has a score of 4. However, because gaps are allowed, a higher score can be achieved, namely, 5, which can be shown to be the highest score possible. An optimal alignment, that is, an alignment that achieves this highest score by aligning five symbols, is . In some cases, there is only one, unique optimal alignment, but in general there can be many. For example, also has a score of 5.

OCR for page 56
Page 59 The unit-cost scoring scheme of Figure 3.1 is not the only possible scheme. Later in this chapter, we will see a much more complex scoring scheme used in the comparison of proteins (20-letter alphabet). In that scheme and other scoring schemes, the scores in the table are real numbers assigned on the basis of various interpretations of empirical evidence. Let us introduce here a formal framework to assist our thinking. Figure 3.1 Unit-cost scoring scheme. Consider comparing sequence A = a1a2···aM and sequence B = b1b2··· bN, whose symbols range over some alphabet y, for example, y = {A,C,G,T} for DNA sequences. Let d (a,b) be the score for aligning a with b, let d(a,-) be the score of leaving symbol a unaligned in sequence A, and let d(-,b) be the score of leaving b unaligned in B. Here a and b range over the symbols in y and the gap symbol "-". The score of an alignment is simply the sum of the scores d assigns to each pair of aligned symbols, for example, the score of is d(A,A) + d (T,-) + d (T,T) + d (A,A) + d (-,T) + d (C,C) + d (G,G), which for the scoring scheme of Figure 3.1 equals 5. An optimal alignment under a given scoring scheme is an alignment that yields the highest sum. Visualizing Alignments: Edit Graphs Many investigators have found it illuminating to convert the problem of finding similarities into one of finding certain paths in an edit graph.

OCR for page 56
Page 60 Proceeding formally, the edit graph GA,B for comparing sequences A and B is an edge-labeled directed graph, as illustrated in Figure 3.2 for the example mentioned above. The vertices of the graph are arranged in an M + 1 by N +1 rectangular grid or matrix, so that (i, j) designates the vertex in column i and row j (where the numbering starts at 0). The following edges, and only these edges, are in GA,B: The edit graph has the property that paths and alignments between segments of A and B are in isomorphic correspondence. That is, any path from vertex (g,h) to vertex (i,j) for g £ i and h £ j models an alignment between the substrings ag+1ag+2··· ai and bh+1bh+2 ···bj, and vice versa. The alignment modeled by a path is the sequence of aligned pairs given by labels on its edges. For example, in Figure 3.2 the two highlighted paths, both from vertex (0,0) to (6,6) correspond to the two optimal global alignments and . The Basic Dynamic Programming Algorithm We now turn to devising an algorithm, or computational procedure, for finding the score of an optimal global alignment between sequences A and B. We focus on computing just the score for the moment, and return to the goal of delivering an alignment achieving that score at the end of this

OCR for page 56
Page 61 Figure 3.2 GA,B  for A  = ATTACG and  B  = ATATCG. subsection. First, observe that, in terms of the edit graph formulation, we seek the score of a maximal-score path from the vertex Q at the upper left-hand corner of the graph GA,B to the vertex F at the lower right-hand corner. Consider computing S(i, j), the score of a maximal-score path from Q to some given vertex (i, j) in the graph. Because there are only three edges directed into vertex (i, j) it follows that any optimal path P to (i, j) must fit one of the following three cases: (1) P is an optimal path to (i - 1, j) followed by the A-gap edge into (i, j); (2) P is an optimal path to (i, j - 1) followed by the B-gap edge into (i, j); or (3) P is an optimal path to

OCR for page 56
Page 62 (i - 1, j - 1) followed by the alignment edge into (i, j). It is critical to note that the subpath preceding the last edge of P must also be optimal, for, if it is not, then it is easy to show that P cannot be optimal, a contradiction. This observation immediately leads to the fundamental recurrence: S(i, j) = max {S(i - 1, j - 1) + d (ai,bj), S(i - 1, j) + d (ai,-), S(i, j - 1) + d (-,bj)}, which states that the maximal score of a path to (i, j) is the larger of (1) the maximal score of a path to (i -1, j) plus the score of the A-gap edge to (i, j), (2) the maximal score of a path to (i, j - 1) plus the score of the B-gap edge to (i, j), or (3) the maximal score of a path to (i - 1, j - 1) plus the score of the alignment edge to (i, j). All that is needed to have an effective computational procedure based on this recurrence is to determine an order in which to compute S-values. There are many possible orders. Three simple alternatives are (1) column by column from left to right, top to bottom in each column, (2) row by row from top to bottom, left to right in each row, and (3) antidiagonal by antidiagonal from the upper left to the lower right, in any order within an antidiagonal (antidiagonal k consists of the vertices (i, j) such that (i + j = k ). Using the first sample ordering leads to the algorithm of Figure 3.3. In this algorithm, M denotes the length of A and N denotes the length of B. The algorithm of Figure 3.3 computes S(i, j) for every vertex (i, j) in an (M + 1) ´ (N + 1) matrix in the indicated order of i and j. Along the left and upper boundaries of the edit graph (that is, vertices with i = 0 or j = 0, respectively), the algorithm utilizes the recurrence, except that terms referencing nonexistent vertices are omitted (that is, in lines 3 and 5, respectively). The algorithm of Figure 3.3 takes O(MN) time; that is, when M and N are sufficiently large, the time taken by the algorithm does not grow faster than the quantity MN. If one stores the whole

OCR for page 56
Page 63 Figure 3.3 The classical dynamic programming algorithm. (M + 1) ´ (N + 1) matrix S, then the algorithm also requires O(MN) space. The algorithm of Figure 3.3 is a dynamic programming algorithm that utilizes the fundamental recurrence. Dynamic programming is a general computational paradigm of wide applicability (see, for example, Horowitz and Sahni, 1978). A problem can be solved by dynamic programming if the final answer can be efficiently determined by computing a tableau of optimal answers to progressively larger and larger subproblems. The principle of optimality requires that the optimal answer to a given subproblem be expressible in terms of optimal answers to smaller subproblems. Our basic sequence comparison problem does yield to this principle: the optimal answer S(i, j) for the problem of comparing prefix Ai = a1a2···a i and prefix Bj = b1b2. . .bj can be found by computing optimal answers for smaller prefixes of A and B. The recurrence formula describes the relationship of each subproblem to a larger subproblem. The algorithm of Figure 3.3 computes only the score of a maximum-scoring global alignment between A and B. One or all of these optimal alignments can be recovered by tracing the paths backwards from

OCR for page 56
Page 64 F to Q with the aid of the now complete matrix S. Specifically, an edge from vertex v1 to F is on an optimal path if S(v1) plus the score of its edge equals S(F). If v1 is on an optimal path, then, in turn, an edge from v2 to v1 is on an optimal path if S(v2) plus the score of the edge equals S(v1). In this way, one can follow an optimal path back to the start vertex Q. In essence, this traceback procedure moves backwards from a vertex to the preceding vertex whose term in the three-way maximum of the recurrence yielded the maximum. The possibility of ties creates the possibility of more than a single optimal path. Unfortunately, this traceback technique for identifying one or more optimal paths requires that the entire matrix S be retained, giving an algorithm that takes O(MN) space as well as time. A more space-efficient approach to delivering an optimal alignment begins with the observation that if only the score of an optimal alignment is desired then only the value of S(M, N) is needed, and so S-values can be discarded once they have been used in computing the values that depend on them. Observing that one need only know the previous column in order to compute the next one, it follows that only two columns need be retained at any instance, and so only O(N) space is required. Such a score-only algorithm can be used as a subprocedure in a divide-and-conquer algorithm that determines an optimal alignment using only O(M + N) space. The divide step consists of finding the midpoint of an optimal source-to-sink path by running the score-only algorithm on the first half of B and the reverse of the second half of B. The conquer step consists of determining the two halves of this path by recursively reapplying the divide step to the two halves. Myers and Miller (1988) have shown this strategy to apply to most comparison algorithms that have linear-space score-only algorithms. This refinement is very important, since space, not time, is often the limiting factor in computing optimal alignments between large sequences. For example, two sequences of length 100,000 can be compared in several hours of CPU time, but would require 10 billion units of memory if optimal alignments were delivered using the simple O(MN) space traceback approach. This is well beyond the memory capacity of any conventional machine.

OCR for page 56
Page 65 Finding Local Similarities We now turn to the problem of finding local alignments, that is, subsegments of A and B that align with maximal score. Local alignments can be visualized as paths in the edit graph, GA,B. Unlike the global alignment problem, the path may start and end at any vertices, not just from Q and F. Intrinsic to determining local similarities is the requirement that the scoring scheme d be designed with a negative bias. That is, for alignment of unrelated sequences (under some suitable stochastic model of the sequences) the score of a path must on the average be negative. If this were not the case, then longer paths would tend to have higher scores, and one would generally end up reporting a global alignment between two sequences as the optimal local alignment. For example, the simple scoring scheme of Figure 3.1 is not negatively biased, whereas the scheme of Figure 3.4 is. Note that under this new scheme, the alignment is now optimal with score 3.34, whereas now has lesser score 3. In this case, the optimal alignment happened to be global, but for longer sequences this is generally not the case. For example, the best local alignment between GAGGTTGCTGAGAA and ACTCTTCTTCCTTA is the alignment score 4.34 between the underlined substrings. of score 4.34 between the underlined substrings. Figure 3.4 A local-alignments scoring scheme. The design of scoring schemes that properly weigh alignments to expose biologically meaningful local similarities is the subject of much investigation. The score of alignments between protein sequences is the sum of scores assigned to individual pairs of aligned symbols, just as for

OCR for page 56
Page 66 DNA. However, since proteins are represented by combinations of 20 letters and the gap symbol, the table of scores is now 21 ´ 21. These scores may be chosen by users to fit the notion of similarity they have in mind for the comparison. For example, Dayhoff et al. (1983) compiled statistics on the frequency with which one amino acid would mutate into another over a fixed period of time and from these built a table of aligned symbol scores consisting of the logarithm of the normalized frequencies. Under Dayhoff's scoring scheme, the score of an alignment is a coarse estimate of the likelihood that one segment has mutated into the other. Figure 3.5 is a scaled integer approximation of Dayhoffs matrix that is much used in practice today. The basic issue in local alignment, just as in the case of global alignment, is to find a path of maximal score. However, there are more degrees of freedom in the local alignment problem: where the paths begin and where they end is not given a priori but is part of the problem. Note that if we knew the vertex (g,h) at which the best path began, we could find its score and end-vertex by setting S(g,h) to 0 and then applying the fundamental recurrence to all vertices (i, j) for which i ³ g and j ³ h. We can capture all potential start vertices simultaneously by modifying the central recurrence so that 0 is a term in the computation of the maximum; that is, S(i, j) = {0,S(i - 1,j - 1) + d (ai,bj), S(i -1,j) + d (ai,-), S(i,- 1) + d (-,bj)}. Indeed, with this simple modification, S(i, j) is now the score of the highest-scoring path to (i, j) that begins at some vertex (g, h) for which g £ i and h£ j . The best score of a path in the edit graph is then the maximum over all vertices in the graph of their S-values. A vertex achieving this maximum is the end of an optimal path. This basic result is often referred to as the Smith-Waterman algorithm after its inventors (Smith and Waterman, 1981). The beginning of the path, the segments it aligns, and the alignment between these segments can all be delivered in linear space by further extensions of the treatment given above for global alignments. If one uses such a comparison algorithm with the scoring

OCR for page 56
Page 79 N, there are algorithms that solve the text searching problem in O(P+N), O(PN), and O(PN3) time, depending on whether the pattern is a simple sequence, a regular expression, or context-free language, respectively. Fusing the concept of exact pattern matching and sequence comparison gives rise to the class of approximate pattern matching problems. Given a pattern, a database, a scoring scheme, and a threshold, one seeks all substrings of the database that align to some sequence denoted by the pattern with score better than the threshold. In essence, one is looking for substrings that are within a given similarity neighborhood of an exact match to the pattern. Within this framework, the similarity search problem is an approximate pattern matching problem where the pattern is a simple sequence. We showed earlier that this problem can be solved in O(PN) time. For the case of regular expressions, the approximate match problem can also be solved in O(PN) time (Myers and Miller, 1989), and, for context-free languages, an O(PN3) algorithm is known (Myers, 1994a). While the cost of searching for approximate matches to context-free languages is prohibitive, searching for approximate matches to regular expressions is well within reason and finds applications in searching for matches to structural patterns that occur in proteins. Parallel Computing The basic problem of comparing sequences has resisted better than quadratic, O(MN) time algorithms. This has led several investigators to study the use of parallel computers to achieve greater efficiency. As stated above, the S-matrix can be computed in any order consistent with the data dependencies of the fundamental recurrence. One naturally thinks of a row-by-row or column-by-column evaluation, but we pointed out as a third alternative that one could proceed in order of antidiagonals. Let antidiagonal k be the set of entries {(i j): i + j = k}. Note that to compute antidiagonal k, one only needs antidiagonals k - 1 and k - 2. The critical observation for parallel processing is that each entry in this antidiagonal can be computed independently of the other entries in the antidiagonal, a fact not true of the row-by-row and column-by-column

OCR for page 56
Page 80 evaluation procedures. For large SIMD (single-instruction, multiple-data) machines, a processor can be assigned to each entry in a fixed antidiagonal and compute its result independently of the others. With O(M) processors, each antidiagonal can be computed in constant time, for a total of O(N) total elapsed time. Note that total work, which is the product of processors and time per processor, is still O(MN). The improvement in time stems from the use of more processors, not from an intrinsically more efficient algorithm. This observation about antidiagonals has been used to design custom VLSI (very large scale integration) chips configured in what is called a systolic array. The ''array" consists of a vector of processors, each of which is identical, performs a dedicated computation, and communicates only with its left and right neighbors, making it easy to lay out physically on a silicon wafer. For sequence comparisons, processor i computes the entries for row i and contains three registers that we will call L(i), V(i), and U(i). At the completion of the k th step, the processors contain antidiagonals k and k - 1 in their L and V registers, respectively, and the characters of B flow through their U registers. That is, L(i)k = S(i, k - i - 1), V(i)k  = S(i, k - i), and U(i)k = bk-i , where X(i)k denotes the value of register X at the end of the k th step. It follows from the basic recurrence for S-values that the following recurrences correctly express the values of the registers at the end of step k + 1 in terms of their values at the end of step k: These recurrences reveal that to accomplish step k + 1, processor i - 1 must pass its register values to processor i and each processor must have just enough hardware to perform three additions and a three-term maximum. Moreover, each processor must have a (2|Y|+1)-element

OCR for page 56
Page 81 memory that can be loaded with the scores for d(ai, ?), d (-, ?), and d (ai, -) where ? is any symbol in the underlying alphabet Y. The beauty of the systolic array is that it can perform comparisons of A against a stream of B sequences, processing each symbol of the target sequences in constant time per symbol. With current technology, chips of this kind operate at rates of 3 million to 4 million symbols per second. A systolic array of 1,000 of these simple processors computes an aggregate of 3 billion to 4 billion dynamic programming entries per second. Comparing One Sequence Against a Database The current GENBANK database (Benson et al., 1993) of DNA sequences contains approximately 191 million nucleotides of sequence in about 183,000 sequence entries, and the PIR database (Barker et al., 1993) of protein sequences contains about 21 million amino acids of data in about 71,000 protein entries. Whenever a new DNA or protein sequence is produced in a laboratory, it is now routine practice to search these databases to see if the new sequence shares any similarities with existing entries. In the event that the new sequence is of unknown function, an interesting global or local similarity to an already-studied sequence may suggest possible functions. Thousands of such searches are performed every day. In the case of protein databases, each entry is for a protein between 100 and 1,500 amino acids long, the average length being about 300. The entries in DNA databases have tended to be for segments of an organism's DNA that are of interest, such as stretches that code for proteins. These segments vary in length from 100 to 10,000 nucleotides. The limited length here is not intrinsic to the object as in the case of proteins, but because of limitations in the technology and the cost of obtaining long DNA sequences. In the early 1980s the longest consecutive stretches being sequenced were up to 5,000 nucleotides long. Today the sequences of some viruses of length 50,000 to 100,000 have been determined. Ultimately, what we will have is the entire sequence of DNA in a chromosome (100 million to 10 billion nucleotides), and entries in the database will simply be annotations describing interesting parts of these massive sequences.

OCR for page 56
Page 82 A similarity search of a database takes a relatively short query sequence of a protein or DNA fragment and searches every entry in the database for evidence of similarity with the query. In protein databases, the query sequence and the entries in the database are typically of similar sizes. In DNA databases, the entries are typically much longer than the query sequence, and one is looking for subsegments of the entry that match the query. Heuristic Algorithms The problem of searching for protein similarities efficiently has led many investigators to abandon dynamic programming algorithms (for which the size of the problem has become too large) and instead consider designing very fast heuristic procedures: simple, often ad hoc, computational procedures that produce answers that are "nearly" correct with respect to a formally stated optimization criterion. One of the most popular database searching tools of this genre is FASTA (Lipman and Pearson, 1985). FASTA looks for entries that share a significant number of short identical subsequences of symbols with the query sequence. Any entry meeting this criterion is then compared via dynamic programming with the query sequence. In this way, the vast majority of entries are eliminated from consideration quickly. FASTA reports most of the alignments that would be identified by an equivalent dynamic programming calculation, but it misses some matches and also reports some spurious matches. On the other hand, FASTA is very fast. A more recently developed heuristic algorithm is BLASTA (Altschul et al., 1990). BLASTA is faster than FASTA but is capable of detecting biologically meaningful similarities with accuracy comparable to that of FASTA. Given a query A and an entry B, BLASTA searches for segment pairs of high score. A segment pair is a substring from A and a substring from B of equal length, and the score of the pair is that of the no-gap alignment between them. One can argue that the presence of a high-scoring segment pair or pairs is evidence of functional similarity between proteins, because insertion and deletion events tend to significantly change the shape of a protein and hence its function. Note that segment pairs embody a local similarity concept. What is particularly useful is that there is a formula for the probability that two sequences have

OCR for page 56
Page 83 a segment pair above a certain score. Thus BLASTA can give an assessment of the statistical significance of any match that it reports. For a given threshold, T, BLASTA returns to the user all database entries that have a segment pair with the query of score greater than T ranked according to probability. BLASTA may miss some such matches, although in practice it misses very few. The central idea used in BLASTA is the notion of a neighborhood. The t-neighborhood of a sequence S is the set of all sequences that align with S with score better than t. In the case of BLASTA, the t -neighborhood of S is exactly those sequences of equal length that form a segment pair of score higher than t under the Dayhoff scoring scheme (see Figure 3.5). This concept suggests a simple strategy for finding all entries that have segment pairs of length k and score greater than t with the query: generate the set of all sequences that are in the t -neighborhood of some k-substring of the query and see if an entry contains one of these strings. Scanning for an exact match to one of the strings in the neighborhood can be performed very efficiently: on the order of 0.5 million characters per second on a 20 SPECint computer. Unfortunately, for the general problem, the length of the segment pair is not known in advance, and even more devastating is the fact that the number of sequences in a neighborhood grows exponentially in both k and t, rendering it impractical for reasonable values of t. To circumvent this difficulty, BLASTA uses the fast scanning strategy above to find short segment pairs of length k above a score t, and then checks each of these to see if they are a portion of a segment pair of score T or greater. This approach is heuristic (that is, may miss some segment pairs) because it is possible for every length k subsegment pair of a segment pair of score T to have score less than t. Nonetheless, with k = 4 and t = 17 such misses are very rare, and BLASTA takes about 3 seconds for every 1 million characters of data searched. To get an idea of the relative efficiency of various similarity searching approaches, consider the following rough timing estimates for a typical 20 SPECint workstation and a search against a typical protein query. The dynamic programming algorithm for local similarities presented above (also known as the Smith-Waterman algorithm) takes roughly 1000.0N microseconds to search a database with a total of N characters in it. On the other hand, FASTA takes 20.0N microseconds, and BLASTA only about 2.0N microseconds. At the other end of the spectrum, the systolic array

OCR for page 56
Page 84 chip described above takes only 0.3N microseconds to perform the Smith-Waterman algorithm with its special-purpose (and expensive) hardware. Sublinear Similarity Searches The total number N of characters of sequence in biosequence databases is growing exponentially. On the other hand, the size of the query sequences is basically fixed; for example, a protein sequence's length is bounded by 1,500 and averages 300. So designers of efficient computational methods should be principally concerned with how the time to perform such a search grows as a function of N. Yet all currently used methods take an amount of time that grows linearly in N; that is, they are O(N) algorithms. This includes not only rigorous methods such as the dynamic programming algorithms mentioned above but also the popular heuristics FASTA and BLASTA. Even the systolic array chips described above do not change this. When a database increases in size by a factor of 1,000, all of these O(N) methods take 1,000 times longer to search that database. Using the timing estimates given above, it follows that while a custom chip may take about 3 seconds to search 10 million amino acids or nucleotides, it will take 3,000 seconds, or about 50 minutes, to search 10 billion symbols. And this is the fastest of the linear methods: BLASTA will take hours, and the Smith-Waterman algorithm will take months. One could resort to massive parallelism, but such machinery is beyond the budget of most investigators, and it is unlikely that speedups due to improvements in hardware technology will keep up with sequencing rates in the next decade. What would be very desirable, if not essential, is to have search methods with computing time sublinear in N, that is, O(Na) for some a < 1. For example, suppose there is an algorithm that takes O(N0.5) time, which is to say that as N grows, the time taken grows as the square root of N. For example, if the algorithm takes about 10 seconds on a 10 million symbol database, then on 10 billion symbols, it will take about 1,0000.5  » 31 times longer, or about 5 minutes. Note that while an O(N0.5) algorithm may be slower than an O(N) algorithm on 10 million symbols, it may be faster on 10 billion. Figure 3.9 illustrates this

OCR for page 56
Page 85 "crossover": in this figure, the size of N at which the O(N0.5) algorithm overtakes the O(N) algorithm is approximately l × 108. Similarly, an O(N 00.2) algorithm that takes, say, 15 seconds on 10 million symbols, will take about 1 minute, or only 4 times longer, on 10 billion. To forcefully illustrate the point, we chose to let our examples be slower at N = 10 million than the competing O(N) algorithm. As will be seen in a moment, a sublinear algorithm does exist that is actually already much faster on databases of size 10 million. The other important thing to note is that we are not considering heuristic algorithms here. What we desire is nothing less than algorithms that accomplish exactly the same computational task of complete comparison as the dynamic programming algorithms, but are much faster because the computation is performed in a clever way. A recent result on the approximate string matching problem under the simple unit-cost scheme of Figure 3.1 portends the possibility of truly sublinear algorithms for the general problem. For relatively stringent matches, this new algorithm is 3 to 4 orders of magnitude more efficient Figure 3.9 Sublinear versus linear algorithms.

OCR for page 56
Page 86 than the equivalent dynamic programming computation on a database of 1 million characters. On the other hand, the approximate string matching problem is a special case of the more biologically relevant computation that involves more general scoring schemes such as the ones in Figures 3.4 and 3.5, and a sublinear algorithm for the general problem has yet to be achieved. We conclude with a few more details on this sublinear algorithm (Myers, 1994b). For a search of matches to a query of length P with D or fewer differences, the quantity e = D/P is the maximum fraction of differences permitted per unit length of the query and is called the mismatch ratio. Searching for such an approximate string match over a database of length N can be accomplished in O(DNpow(e) log N) expected time with the new algorithm. The exponent is an increasing and concave function of e that is 0 when e = 0 and depends on the size |Y| of the underlying alphabet. The algorithm is superior to the O(N) algorithms and truly sublinear in N when e is small enough to guarantee that pow(e) < 1. For example, pow(e) is less than 1 when e < 33 percent for |Y| = 4 (DNA alphabet) and when e < 56 percent for |Y| = 20 (protein alphabet). More specifically, pow(e) £ 0.22 + 2.3 e when |Y| = 4 and pow(e) £ 0.17 + 1.4e when |Y| = 20. So, for DNA, the algorithm takes a maximum of O(N0.5) time when e is 12 percent, and for proteins, a maximum of O(N0.5) time when e is 26 percent. The logic used to prove these bounds is coarse, and, in practice, the performance of these methods is much better than the bounds indicate. If these results can be extended to handle the more general problem of arbitrary scoring tables, the impact on the field could be great. Open Problems While progress is continually being made on existing problems in sequence comparison, new problems continue to arise. A fundamental issue is the definition of similarity. We have focused here only on the insertion-deletion-substitution model of comparison and some small variations. Some authors (e.g., Altshul and Erikson, 1986) have looked at

OCR for page 56
Page 87 nonadditive scoring schemes that are intended to reflect the probability of finding a given alignment by chance. A fundamental change in the optimization criterion for alignment creates a new set of algorithmic problems. What about fundamentally speeding up sequence comparisons? The best lower bounds placed on algorithms for comparing two sequences of length N is O(N log N), yet the fastest algorithm takes O(N2 / log2 N) time (Masek and Paterson, 1980). Can this gap be narrowed, either from above (finding faster algorithms) or below (finding lower bounds that are higher)? Can we perform faster database searches for the case of generalized Levenshtein scores, as is suggested by the results given above for the approximate string matching problem? Speeding up database searches is very important. Are there other effective ways to parallelize such searches or to exploit preprocessing of the databases, such as an index? Biologists are interested in searching databases for patterns other than given strings or regular expressions. Recently, fast algorithms have been developed (Kannan and Myers, 1993; Landau and Schmidt, 1993) for finding approximate repeats, for example, finding a pattern that matches some string X and then 5 to 10 symbols to the right matches the same string modulo 5 percent differences. Many DNA structures are induced by forming base pairs that can be viewed as finding approximate palindromes separated by a given range of spacing. More intricate patterns for protein motifs and secondary structure are suggested by the systems QUEST (Arbarbanel et al., 1984), ARIADNE (Lathrop et al., 1987), and ANREP (Mehldau and Myers, 1993), all of which pose problems that could use algorithmic refinement. Finally, biologists compare objects other than sequences. For example, the partial sequence information of a restriction map can be viewed as a string on which one has placed a large number of beads of, say, eight colors, at various positions along the string. Given two such maps, are they similar? This problem has been examined by several authors (e.g., Miller et al., 1990). There are still fundamental questions as to what the measure of similarity should be and how to design efficient algorithms for each. There has also been work on comparing phylogenetic trees and chromosome staining patterns (e.g., Zhang and Shasha, 1989). Indubitably the list will continue to grow.

OCR for page 56
Page 88 References Altschul, S.F., and B.W. Erikson, 1986, "Locally optimal subalignments using nonlinear similarity functions," Bull. Math. Biol. 48(5/6), 633-660. Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman, 1990, "A basic local alignment search tool," Journal of Molecular Biology 215, 403-410. Arbarbanel, RM., P.R. Wieneke, E. Mansfield, D.A. Jaffe, and D.L. Brutlag, 1984, "Rapid searches for complex patterns in biological molecules," Nucleic Acids Research 12(1), 263-280. Barker, W.C., D. George, H.-W. Mewes, F. Pfeiffer, and A. Tsugita, 1993, "The PIRInternational database," Nucleic Acids Research 21(13), 3089-3092. Benson, D., D.J. Lipman, and J. Ostell, 1993, "GenBank," Nucleic Acids Research 21(13), 2963-2965. Chao, K.-M., and W. Miller, 1994, "Linear-space algorithms that build local alignments from fragments," Algorithmica, in press. Dayhoff, M.O., W.C. Barker, and L.T. Hunt, 1983, "Establishing homologies in protein sequences," Methods in Enzymology 91, 524-545. Eppstein, D., Z. Galil, and R. Giancarlo, 1989, "Speeding up dynamic programming," Theoretical Computer Science 64, 107-118. Feng, D.F., and R.F. Doolittle, 1987, "Progressive sequence alignment as a prerequisite to correct phylogenetic trees," Journal of Molecular Evolution 25, 351-360. Fox, B., 1973, "Calculating the Kth shortest paths," INFOR J (Can. J. Oper. Res. Inf Process.) 11, 66-70. Garey, M.R., and D.S. Johnson, 1979, Computers and Intractability: A Guide to the Theory of NP-Complete Problems, New York: W.H. Freeman Press. Gotoh, O., 1982, "An improved algorithm for matching biological sequences," Journal of Molecular Biology 162, 705-708. Horowitz, E., and S. Sahni, 1978, pp. 198-247 in Fundamentals of Computer Algorithms, New York: Computer Science Press. Kannan, S.K., and E.W. Myers, 1993, "An algorithm for locating non-overlapping regions of maximum alignment score," Proceedings of the 4th Combinatorial Pattern Matching Conference, Springer-Verlag Lecture Notes in Computer Science 684, 74-86. Landau, G.M., and J.P. Schmidt, 1993, "An algorithm for approximate tandem repeats," Proceedings of the 4th Combinatorial Pattern Matching Conference, Springer-Verlag Lecture Notes in Computer Science 684, 120-133. Lathrop, RH., T.A. Webster, and T.F. Smith, 1987, "ARIADNE: A flexible framework for protein structure recognition," Commun. ACM 30, 909-921. Lipman, D.J., and W.R. Pearson, 1985, "Rapid and sensitive protein similarity searches," Science 227, 1435-1441. Masek, W.J., and M.S. Paterson, 1980, "A faster algorithm for computing string-edit distances," Journal of Computing Systems Science 20(1), 18-31. Mehldau, G., and E.W. Myers, 1993, "A system for pattern matching applications on biosequences," CABIOS 9, 3, 299-314.

OCR for page 56
Page 89 Miller, W., and E.W. Myers, 1988, "Sequence comparison with concave weighting functions," Bull. Math. Biology 50(2), 97-120. Miller, W., J. Ostell, and K.E. Rudd, 1990, "An algorithm for searching restriction maps," CABIOS 6,247-252. Myers, E.W., 1994a, "Approximately Matching Context Free Languages," TR94-22, Department of Computer Science, University of Arizona, Tucson, Ariz. Myers, E.W., 1994b, "A sublinear algorithm for approximate keywork searching," Algorithmica 12(4), 345-374. Myers, E.W., and W. Miller, 1988, "Optimal alignments in linear space," CABIOS 4(1), 11-17. Myers, E.W., and W. Miller, 1989, "Approximate matching of regular expressions," Bull. Math. Biol. 51(1), 5-37. Sankoff, D., 1975, "Minimal mutation trees of sequences," SIAM Journal of Applied Mathematics 28(1), 35-42. Smith, T.F., and M.S. Waterman, 1981, "Identification of common molecular sequences," Journal of Molecular Biology 147, 195-197. Smith, T.F., and W.S. Fitch, 1983, "Optimal sequence alignments," Proceedings of the National Academy of Sciences USA 80, 1382-1386. Waterman, M.S., and M. Eggert, 1987, "A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons," Journal of Molecular Biology 197, 723728. Waterman, M.S., M. Eggert, and E. Lander, 1992, "Parametric sequence comparisons," Proceedings of the National Academy of Sciences USA 89, 6090-6093. Waterman, M.S., T.F. Smith, and W.A. Beyer, 1976, "Some biological sequence metrics," Advances in Mathematics 20, 367-387. Zhang, K., and D. Shasha, 1989, "Simple fast algorithms for the editing distance between trees and related problems," SIAM Journal on Computing 18, 1245-1262.