sequences, whereas a hyperbolic line along the top and left of the graph results from sequences that form no discernible clusters. Finally, the graph helps users view the tradeoff between coverage and specificity for various motifs and allows them to select motifs interactively.
Assigning Function to Novel Proteins. The motifs in the IDENTIFY database are particularly valuable for assigning function to newly sequenced proteins, either individually or in large-scale searches. Motifs are particularly well-suited to large-scale searching tasks. Motifs can be used to search a database very quickly, and many fast algorithms for performing regular expression searches exist. In addition, because motifs in the IDENTIFY database are characterized by their specificity, a search using motifs can be tailored to provide maximum sensitivity for a given desired level of specificity and to minimize false positives.
Each motif also is linked to the BLOCKS or PRINTS databases, which describe the family of proteins from which it was derived. Because these protein families typically have several members, a match to a motif may provide an association with several other members of the family. In addition, when a match to a motif is obtained, that motif may be used to search sequence databases, such as SWISSPROT and GenPept. for other proteins that share this motif. This function, which is implemented in IDENTIFY, provides all sequences that may share a closely related form of the motif and thereby represent a particular subfamily containing the motif.
More importantly, most families in the PRINTS and BLOCKS databases are represented by several motifs, each corresponding to a different conserved region of the family. On average, each family has 3–4 conserved regions. The presence of multiple conserved regions increases the sensitivity of a search using motifs. Furthermore, they provide additional certainty about a functional assignment, above the statistical estimate of significance, when several independent motifs match a given unknown sequence.
Motifs, such as those in IDENTIFY, are useful for assigning functions to proteins even in the absence of any homology apart from the limited motif regions. Unlike similarity search methods that weight every position in a sequence alignment to some extent, motifs evaluate only those positions that show conservation in the training set. Hence, motifs can discover function and assign a protein to a family even if that protein is so distantly related that it shows no sequence similarity outside the motifs. This explains why IDENTIFY can assign function to 172 proteins from the yeast genome that have no significant homology to any known protein. The frequency with which IDENTIFY assigns function to these nonhomologous proteins (172/833=21%) is somewhat less than the frequency with which IDENTIFY assigns function to the bulk of the yeast proteins (1,621/6,220=26%). The ability of motifs to assign function by using only homology at particular positions makes them particularly useful for evaluating newly sequenced genomes such as M.jannaschii, most of whose proteins are not homologous to other organisms.
Currently, IDENTIFY assigns function to about 25–30% of novel protein sequences. This limit reflects, among other things, the fraction of newly sequenced proteins that share at least one motif with a current protein family present in the BLOCKS or PRINTS databases. As more genomes are sequenced and more protein families are defined in these databases, IDENTIFY should be able to assign function to a larger fraction of proteins. Despite this current limitation, IDENTIFY is a valuable tool for assignment of function to newly sequenced proteins, especially in those cases where there are no significant sequence similarities by alignment, profile, or hidden Markov methods.
Availability. Access to the EMOTIF and IDENTIFY programs is available over the Internet at http://motif.stanford.edu/emotif and http://motif.stanford.edu/identify. Nonprofit institutions wishing to install the programs locally may send requests to D.L.B. (firstname.lastname@example.org). Commercial and for-profit institutions can license the programs from Pangea Systems Inc. or from Stanford’s Office of Technology Licensing.
This work was supported by a grant from SmithKline Beecham Pharmaceuticals and by Grant LM 05716 from the National Library of Medicine. T.D.W. is a Howard Hughes Medical Institute Physician Postdoctoral Fellow.
1. Scharf, M., Schneider, R., Casari. G., Bork, P., Valencia, A., Ouzounis. C. & Sander, C. (1994) ISMB 2, 348–353.
2. Casari, G., Ouzounis, C., Valencia, A. & Sander, C. (1996) in GeneQuiz II: Automatic Function Assignment for Genome Sequence Analysis, Pacific Symposium and Biocomputing, 1996 (World Scientific, Kohala Coast, HI), pp. 707–709.
3. Altschul, S.F., Madden. T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) Nucleic Acids Res. 25, 3389–3402.
4. Sonnhammer, E.L., Eddy, S.R. & Durbin, R. (1997) Proteins 28, 405–420.
5. Attwood, T.K., Beck, M.E., Bleasby, A.J. & Parry-Smith, D.J. (1994) Nucleic Acids Res. 22, 3590–3596.
6. Krogh, A., Brown, M., Mian, I.S., Sjolander, K. & Haussler, D. (1994) J. Mol. Biol. 235, 1501–1531.
7. Henikoff, J.G. & Henikoff, S. (1996) Methods Enzymol. 266, 88–105.
8. Gribskov, M. & Veretnik, S. (1996) Methods Enzymol. 266, 198–211.
9. Attwood, T.K., Beck, M.E., Bleasby, A.J., Degtyarenko, K., Michie, A.D. & Parry-Smith, D.J. (1997) Nucleic Acids Res. 25, 212–217.
10. Holm, L. & Sander, C. (1994) Nucleic Acids Res. 22, 3600–3609.
11. Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. (1995) J. Mol. Biol. 247, 536–540.
12. Holm, L. & Sander, C. (1995) Trends Biochem. Sci. 20, 478–480.
13. Brenner, S.E., Chothia, C., Hubbard, T.J.P. & Murzin, A.G. (1996) Methods Enzymol. 266, 635–642.
14. Orengo, C.A., Michie, A D., Jones, S., Jones, D.T., Swindells, M.B. & Thornton, J.M. (1997) Structure 5, 1093–1108.
15. Holm, L. & Sander, C. (1997) Nucleic Acids Res. 25, 231–234.
16. Hubbard, T.J.P., Murzin, A.G., Brenner, S.E. & Chothia, C. (1997) Nucleic Acids Res. 25, 236–239.
17. Bairoch, A. & Apweiler, R. (1997) Nucleic Acids Res. 25, 31–36.
18. Bairoch, A., Bucher, P. & Hofmann, K. (1997) Nucleic Acids Res. 25, 217–221.
19. Nevill-Manning, C., Sethi, K., Wu, T.D. & Brutlag, D.L. (1997) ISMB-97 4, 202–209.
20. Henikoff, S., Henikoff, J.G., Alford, W.J. & Pietrokovski, S. (1995) Gene 163, GC17–GC26.
21. Hopcroft, J.E. & Ullman, J.D. (1979) Introduction to Automata Theory. Languages and Computation (Addison-Wesley, Reading, MA).
22. Henikoff, J.G., Pietrokovski, S. & Henikoff, S. (1997) Nucleic Acids Res. 25, 222–225.
23. Schneider, R., de Daruvar, A. & Sander, C. (1997) Nucleic Acids Res. 25, 226–230.
24. Jain, A K. & Dubes, R.C. (1988) Algorithms for Clustering Data (Prentice Hall, Englewood Cliffs, NJ).
25. Wu.T.D. & Brutlag, D.L. (1996) ISMB-96 3, 230–240.
26. Cherry, J.M., Ball, C., Weng, S., Juvik, G., Schmidt, R., Adler, C., Dunn, B., Dwight, S., Riles, L., Mortimer, R.K. & Botstein, D. (1997) Nature (London) 387, 67–73.
27. Kidera, A., Yonishi, Y., Masahito, O., Ooi, T. & Scheraga, H.A. (1985) J. Protein Chem. 4, 23–55.
28. Nakai, M., Kidera, A. & Kanehisa, M. (1988) Protein Eng. 2, 93–100.
29. Smith, R.F. & Smith, T.F. (1990) Proc. Natl. Acad. Sci. USA 87, 118–122.
30. Saqi, M.A. & Sternberg, M.J. (1994) Protein Eng. 7, 165–171.
31. Wu, T.D. & Brutlag, D.L. (1995) ISMB 3, 402–410.