similar genes or proteins whose functions have already been identified, the researcher might be able to determine the function of the new item or at least make a reasonable guess.
In the simplest cases, data mining might work like this: A genome scientist has a new, unidentified human gene in hand and proceeds to search through genome databases on other species—the mouse, the fruit fly, the worm Caenorhabditis elegans, and so on—looking for known genes with a similar genetic sequence. Different species share many of the same genes; although the sequence of a particular gene might vary from species to species (more for distantly related species than for closely related ones), it is generally feasible to pick out genes in different species that correspond to a particular gene of interest. If a database search turns up such correspondences, the researcher now has solid evidence about what the newly discovered gene might do.
In reality, the database analysis of genes and proteins has become far more sophisticated than that simple searching for “homologues, ” or items with similar structures. For instance, Brutlag noted, researchers have developed databases of families of sequences in which each family consists of a group of genes or proteins that have a structure or function in common. When a new gene or protein is found, its discoverer can compare it not just one on one with other individual genes or proteins, but with entire families, looking for one in which it fits. This is a more powerful technique than one-to-one comparisons because it relies on general patterns instead of specific details. Just as an unusual-looking mutt can be identified as a dog even if it cannot be classified as a particular breed, a new protein can often be placed in a family of proteins even if it is not a homologue of any known protein.
Researchers have developed a series of databases that can be used to classify genes and proteins, each with a different technique for identifying relationships: sequence motifs, consensus sequences, position-specific scoring matrices, hidden Markov models, and more. “I can hardly keep up with the databases myself,” Brutlag said. With these techniques, researchers can now usually determine what a newly discovered human gene or protein does on the basis of nothing more than the information available in databases, Brutlag said. About a year before the workshop, his group created a database of all known human proteins and their functions. Over the next year, each time a new human protein was analyzed, they analyzed it by using homologues and a technique developed in Brutlag's laboratory called eMATRICES. “Using both methods, we assigned biologic functions to almost 77% of the human proteins. More than three-fourths of new proteins could be characterized by a technician who never left his computer; although the ultimate test remains experi-