Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 60 Proceeding formally, the edit graph GA,B for comparing sequences A and B is an edge-labeled directed graph, as illustrated in Figure 3.2 for the example mentioned above. The vertices of the graph are arranged in an M + 1 by N +1 rectangular grid or matrix, so that (i, j) designates the vertex in column i and row j (where the numbering starts at 0). The following edges, and only these edges, are in GA,B: The edit graph has the property that paths and alignments between segments of A and B are in isomorphic correspondence. That is, any path from vertex (g,h) to vertex (i,j) for g ⤠i and h ⤠j models an alignment between the substrings ag+1ag+2··· ai and bh+1bh+2 ··· bj, and vice versa. The alignment modeled by a path is the sequence of aligned pairs given by labels on its edges. For example, in Figure 3.2 the two highlighted paths, both from vertex (0,0) to (6,6) correspond to the two optimal global alignments and . The Basic Dynamic Programming Algorithm We now turn to devising an algorithm, or computational procedure, for finding the score of an optimal global alignment between sequences A and B. We focus on computing just the score for the moment, and return to the goal of delivering an alignment achieving that score at the end of this
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 61 subsection. First, observe that, in terms of the edit graph formulation, we seek the score of a maximal-score path from the vertex at the upper left-hand corner of the graph GA,B to the vertex Φ at the lower right-hand corner. Figure 3.2 GA,B for A = ATTACG and B = ATATCG. Consider computing S(i, j), the score of a maximal-score path from to some given vertex (i, j) in the graph. Because there are only three edges directed into vertex (i, j) it follows that any optimal path P to (i, j) must fit one of the following three cases: (1) P is an optimal path to (i â 1, j) followed by the A-gap edge into (i, j); (2) P is an optimal path to (i, j â 1) followed by the B-gap edge into (i, j); or (3) P is an optimal path to
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 62 (i â 1, j â 1) followed by the alignment edge into (i, j). It is critical to note that the subpath preceding the last edge of P must also be optimal, for, if it is not, then it is easy to show that P cannot be optimal, a contradiction. This observation immediately leads to the fundamental recurrence: S(i, j) = max {S(i â 1, j â 1) + d (ai,bj), S(i â 1, j) + δ (ai,â), S(i, j â 1) + δ (â,bj)}, which states that the maximal score of a path to (i, j) is the larger of (1) the maximal score of a path to (i â1, j) plus the score of the A-gap edge to (i, j), (2) the maximal score of a path to (i, j â 1) plus the score of the B-gap edge to (i, j), or (3) the maximal score of a path to (i â 1, j â 1) plus the score of the alignment edge to (i, j). All that is needed to have an effective computational procedure based on this recurrence is to determine an order in which to compute S-values. There are many possible orders. Three simple alternatives are (1) column by column from left to right, top to bottom in each column, (2) row by row from top to bottom, left to right in each row, and (3) antidiagonal by antidiagonal from the upper left to the lower right, in any order within an antidiagonal (antidiagonal k consists of the vertices (i, j) such that (i + j = k ). Using the first sample ordering leads to the algorithm of Figure 3.3. In this algorithm, M denotes the length of A and N denotes the length of B. The algorithm of Figure 3.3 computes S(i, j) for every vertex (i, j) in an (M + 1) à (N + 1) matrix in the indicated order of i and j. Along the left and upper boundaries of the edit graph (that is, vertices with i = 0 or j = 0, respectively), the algorithm utilizes the recurrence, except that terms referencing nonexistent vertices are omitted (that is, in lines 3 and 5, respectively). The algorithm of Figure 3.3 takes O(MN) time; that is, when M and N are sufficiently large, the time taken by the algorithm does not grow faster than the quantity MN. If one stores the whole
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 63 (M + 1) à (N + 1) matrix S, then the algorithm also requires O(MN) space. Figure 3.3 The classical dynamic programming algorithm. The algorithm of Figure 3.3 is a dynamic programming algorithm that utilizes the fundamental recurrence. Dynamic programming is a general computational paradigm of wide applicability (see, for example, Horowitz and Sahni, 1978). A problem can be solved by dynamic programming if the final answer can be efficiently determined by computing a tableau of optimal answers to progressively larger and larger subproblems. The principle of optimality requires that the optimal answer to a given subproblem be expressible in terms of optimal answers to smaller subproblems. Our basic sequence comparison problem does yield to this principle: the optimal answer S(i, j) for the problem of comparing prefix Ai = a1a2â¦a i and prefix Bj = b1b2. . .bj can be found by computing optimal answers for smaller prefixes of A and B. The recurrence formula describes the relationship of each subproblem to a larger subproblem. The algorithm of Figure 3.3 computes only the score of a maximum-scoring global alignment between A and B. One or all of these optimal alignments can be recovered by tracing the paths backwards from
SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 64 Φ to with the aid of the now complete matrix S. Specifically, an edge from vertex v1 to Φ is on an optimal path if S(v1) plus the score of its edge equals S(Φ). If v1 is on an optimal path, then, in turn, an edge from v2 to v1 is on an optimal path if S(v2) plus the score of the edge equals S(v1). In this way, one can follow an optimal path back to the start vertex . In essence, this traceback procedure moves backwards from a vertex to the preceding vertex whose term in the three-way maximum of the recurrence yielded the maximum. The possibility of ties creates the possibility of more than a single optimal path. Unfortunately, this traceback technique for identifying one or more optimal paths requires that the entire matrix S be retained, giving an algorithm that takes O(MN) space as well as time. A more space-efficient approach to delivering an optimal alignment begins with the observation that if only the score of an optimal alignment is desired then only the value of S(M, N) is needed, and so S-values can be discarded once they have been used in computing the values that depend on them. Observing that one need only know the previous column in order to compute the next one, it follows that only two columns need be retained at any instance, and so only O(N) space is required. Such a score-only algorithm can be used as a subprocedure in a divide-and-conquer algorithm that determines an optimal alignment using only O(M + N) space. The divide step consists of finding the midpoint of an optimal source-to-sink path by running the score-only algorithm on the first half of B and the reverse of the second half of B. The conquer step consists of determining the two halves of this path by recursively reapplying the divide step to the two halves. Myers and Miller (1988) have shown this strategy to apply to most comparison algorithms that have linear-space score-only algorithms. This refinement is very important, since space, not time, is often the limiting factor in computing optimal alignments between large sequences. For example, two sequences of length 100,000 can be compared in several hours of CPU time, but would require 10 billion units of memory if optimal alignments were delivered using the simple O(MN) space traceback approach. This is well beyond the memory capacity of any conventional machine.