values minus gap penalties) perform much better than those based on the traditional measure of sequence similarity, percentage identity, which parallels the poor performance of our structural alignment statistics based on the traditional rms. It is disconcerting that such well established and intuitive measures such as percentage identity or rms perform so much worse than the statistical measures based on the sequence or structure alignment scores.
Furthermore, it is surprising that over half of the relationships between distant homologues in scop were not statistically significant (at a rate of 1% error per query) using either pure sequence comparison or pure structure comparison. Almost all of the pairs found by sequence comparison were found by structure comparison, but there were many pairs found by structure comparison that were not found by sequence comparison. Overall, structural comparison was able to detect about twice as many of the scop distant homology superfamily pairs as sequence comparison (at the same rate of error).
Future Directions. The approach we have used to derive statistical significance easily could be generalized to other contexts. In particular, it can be adapted to provide significance statistics for threading. We have not presented a detailed examination of the significance values for specific pairs of sequences or structures. Such an examination could prove to be a useful endeavor in the future, particularly if it focused on pairs of proteins with the same fold but insignificant E values and those with different folds but significant E values. These two classes of pairs characterize the twilight zone for structure, which has yet to be described fully.
We thank S.E.Brenner for carefully reading the manuscript and S.E.Brenner and T.Hubbard for providing the pdb40d-1.32 database. M.G. acknowledges the National Science Foundation for support (Grant DB1–9723182), and M.L. acknowledges the Department of Energy (Grant DE-FG03–95ER62135).
1. Rohlf, F. & Slice, D. (1990) Syst. Zool. 39, 40–59.
2. Bookstein, F.L. (1991) Morphometric Tools for Landmark Data (Cambridge Univ. Press, Cambridge, U.K.).
3. Gerstein, M. & Levitt, M. (1998) Protein Sci. 7, 445–456.
4. Subbiah, S., Laurents, D.V. & Levitt, M. (1993) Curr. Biol 3. 141–148.
5. Laurents, D.V., S. Subbiah & Levitt, M. (1994) Protein Sci. 3, 1938–1944.
6. Needleman, S.B. & Wunsch, C.D. (1971) J. Mol. Biol 48, 443–453.
7. Smith, T.F. & Waterman, M.S. (1981) J. Mol. Biol. 147, 195–197.
8. Doolittle, R.F. (1987) Of Urft and Orfs (Univ. Sci. Books, Mill Valley, CA).
9. Gribskov, M. & Devereux, J. (1992) Sequence Analysis Primer (Oxford Univ. Press, New York).
10. Karlin, S. & Altschul, S.F. (1990) Proc. Natl. Acad. Sci. USA 87, 2264–2268.
11. Karlin, S. & Altschul, S.F. (1993) Proc. Natl. Acad. Sci USA 90, 5873–5877.
12. Lipman, D.J. & Pearson, W.R. (1985) Science 227, 1435–1441.
13. Pearson, W.R. (1996) Methods Enzymol. 266, 227–259.
14. Karlin, S., Bucher, P., Brendel, V. & Altschul, S.F. (1991) Annu. Rev. Biophys. Biophys. Chem. 20, 175–203.
15. Altschul, S.F., Boguski, M.S., Gish. W. & Wootton, J.C. (1994) Nat. Gen. 6, 119–129.
16. Bryant, S.H. & Altschul, S.F. (1995) Curr. Opin. Struct. Biol. 5, 236–244.
17. Altschul, S.F. & Gish, W. (1996) Methods Enzymol. 266, 460– 480.
18. Pearson, W.R. (1997) Comput. Appl. Biosci. 13, 325–332.
19. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) Nucleic Acids Res. 25, 3389–3402.
20. Remington, S.J. & Matthews, B.W. (1980) J. Mol. Biol. 140, 77–99.
21. Satow, Y., Cohen, G.H., Padlan, E.A. & Davies, D.R. (1987) J. Mol. Biol. 190, 593–604.
22. Taylor, W.R. & Orengo, C.A. (1989) J. Mol. Biol. 208, 1–22.
23. Artymiuk, P.J., Mitchell, E.M., Rice, D.W. & Willett, P. (1989) J. Inform. Sci. 15, 287–298.
24. Sali, A. & Blundell, T.L. (1990) J. Mol. Biol. 212, 403–428.
25. Vriend, G. & Sander, C (1991) Proteins 11, 52–58.
26. Russell, R.B. & Barton. G.B. (1992) Proteins 14, 309–323.
27. Holm, L. & Sander, C. (1993) J. Mol. Biol. 233, 123–128.
28. Gibrat, J.F., Madej, T. & Bryant, S.H. (1996) Curr. Opin. Struct. Biol. 6, 377–385.
29. Falicov, A. & Cohen, F.E. (1996) J. Mol. Biol. 258, 871–892.
30. Gerstein, M. & Levitt. M. (1996) in Proc. Fourth Int. Conf. on Intell. Sys. Mol. Biol. (American Association for Artificial Intelligence Press. Menlo Park, CA), pp. 59–67.
31. Cohen, G.H. (1998) J. Appl. Crystallography (in press).
32. Holm, L. & Sander, C. (1996) Science 273, 595–602.
33. Bernstein, F. C., Koetzle, T.F., Williams, G.J., Meyer, E.E., Jr., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T. & Tasumi. M. (1977) J. Mol. Biol. 112, 535–542.
34. Abola, S.J., Prilusky J & Manning, N.O. (1997) Methods Enzymol. 277, 556–571.
35. Murzin, A., Brenner, S.E., Hubbard, T. & Chothia, C. (1995) J. Mol. Biol. 247, 536–540.
36. Brenner, S., Chothia, C., Hubbard, T.J.P. & Murzin, A.G. (1996) Methods Enzymol. 266, 635–642.
37. Hubbard, T.J. P., Murzin, A.G., Brenner, S.E. & Chothia, C. (1997) Nucleic Acids Res. 25, 236–239.
38. Brenner, S., Hubbard, T., Murzin, A. & Chothia, C. (1995) Nature (London) 378, 140.
39. Brenner, S., Chothia, C., Hubbard, T. (1998) Proc. Natl. Acad. Sci. USA (in press).
40. Pearson, W.R. & Lipman, D.J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444–2448.
41. Henikoff, S. & Henikoff, J.G. (1993) Proc. Natl. Acad. Sci. USA 19, 6565–6572.
42. Levitt, M. & Chothia, C (1976) Nature (London) 261, 552–558.
43. Brenner, S.E. (1996) Ph.D. thesis (Cambridge Univ., Cambridge, U.K.).
44. Kabsch, W. (1976) Acta Cryst. A 32, 922–923.
45. Gerstein, M. & Altman, R. (1995) Computer Applications in the Biosciences 11, 633–644.
46. Gerstein, M. & Altman, R. (1995) J. Mol. Biol. 251, 161–175.
47. Gerstein, M., Lesk, A.M. & Chothia, C. (1994) Biochemistry 33, 6739–6749.