National Academies Press: OpenBook
« Previous: Sublinear Similarity Searches
Suggested Citation:"OPEN PROBLEMS." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 86
Suggested Citation:"OPEN PROBLEMS." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 87

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 86 than the equivalent dynamic programming computation on a database of 1 million characters. On the other hand, the approximate string matching problem is a special case of the more biologically relevant computation that involves more general scoring schemes such as the ones in Figures 3.4 and 3.5, and a sublinear algorithm for the general problem has yet to be achieved. We conclude with a few more details on this sublinear algorithm (Myers, 1994b). For a search of matches to a query of length P with D or fewer differences, the quantity ε = D/P is the maximum fraction of differences permitted per unit length of the query and is called the mismatch ratio. Searching for such an approximate string match over a database of length N can be accomplished in O(DNpow(ε) log N) expected time with the new algorithm. The exponent is an increasing and concave function of ε that is 0 when ε = 0 and depends on the size |ψ| of the underlying alphabet. The algorithm is superior to the O(N) algorithms and truly sublinear in N when ε is small enough to guarantee that pow(ε) < 1. For example, pow(ε) is less than 1 when ε < 33 percent for |ψ| = 4 (DNA alphabet) and when ε < 56 percent for |ψ| = 20 (protein alphabet). More specifically, pow(ε) ≤ 0.22 + 2.3 ε when |ψ| = 4 and pow(ε) ≤ 0.17 + 1.4ε when |ψ| = 20. So, for DNA, the algorithm takes a maximum of O(N0.5) time when ε is 12 percent, and for proteins, a maximum of O(N0.5) time when ε is 26 percent. The logic used to prove these bounds is coarse, and, in practice, the performance of these methods is much better than the bounds indicate. If these results can be extended to handle the more general problem of arbitrary scoring tables, the impact on the field could be great. OPEN PROBLEMS While progress is continually being made on existing problems in sequence comparison, new problems continue to arise. A fundamental issue is the definition of similarity. We have focused here only on the insertion- deletion-substitution model of comparison and some small variations. Some authors (e.g., Altshul and Erikson, 1986) have looked at

SEEING CONSERVED SIGNALS: USING ALGORITHMS TO DETECT SIMILARITIES BETWEEN BIOSEQUENCES 87 nonadditive scoring schemes that are intended to reflect the probability of finding a given alignment by chance. A fundamental change in the optimization criterion for alignment creates a new set of algorithmic problems. What about fundamentally speeding up sequence comparisons? The best lower bounds placed on algorithms for comparing two sequences of length N is O(N log N), yet the fastest algorithm takes O(N2 / log2 N) time (Masek and Paterson, 1980). Can this gap be narrowed, either from above (finding faster algorithms) or below (finding lower bounds that are higher)? Can we perform faster database searches for the case of generalized Levenshtein scores, as is suggested by the results given above for the approximate string matching problem? Speeding up database searches is very important. Are there other effective ways to parallelize such searches or to exploit preprocessing of the databases, such as an index? Biologists are interested in searching databases for patterns other than given strings or regular expressions. Recently, fast algorithms have been developed (Kannan and Myers, 1993; Landau and Schmidt, 1993) for finding approximate repeats, for example, finding a pattern that matches some string X and then 5 to 10 symbols to the right matches the same string modulo 5 percent differences. Many DNA structures are induced by forming base pairs that can be viewed as finding approximate palindromes separated by a given range of spacing. More intricate patterns for protein motifs and secondary structure are suggested by the systems QUEST (Arbarbanel et al., 1984), ARIADNE (Lathrop et al., 1987), and ANREP (Mehldau and Myers, 1993), all of which pose problems that could use algorithmic refinement. Finally, biologists compare objects other than sequences. For example, the partial sequence information of a restriction map can be viewed as a string on which one has placed a large number of beads of, say, eight colors, at various positions along the string. Given two such maps, are they similar? This problem has been examined by several authors (e.g., Miller et al., 1990). There are still fundamental questions as to what the measure of similarity should be and how to design efficient algorithms for each. There has also been work on comparing phylogenetic trees and chromosome staining patterns (e.g., Zhang and Shasha, 1989). Indubitably the list will continue to grow.

Next: REFERENCES »
Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology Get This Book
×
 Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology
Buy Paperback | $80.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

As researchers have pursued biology's secrets to the molecular level, mathematical and computer sciences have played an increasingly important role—in genome mapping, population genetics, and even the controversial search for "Eve," hypothetical mother of the human race.

In this first-ever survey of the partnership between the two fields, leading experts look at how mathematical research and methods have made possible important discoveries in biology.

The volume explores how differential geometry, topology, and differential mechanics have allowed researchers to "wind" and "unwind" DNA's double helix to understand the phenomenon of supercoiling. It explains how mathematical tools are revealing the workings of enzymes and proteins. And it describes how mathematicians are detecting echoes from the origin of life by applying stochastic and statistical theory to the study of DNA sequences.

This informative and motivational book will be of interest to researchers, research administrators, and educators and students in mathematics, computer sciences, and biology.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!