One problem is simply finding relevant data in a sea of information, Karp said. “If there are 500 databases out there, at least, how do we know which ones to go to, to answer a question of interest?” Fortunately for biologists, some locator help is available, noted Douglas Brutlag, professor of biochemistry and medicine at Stanford University. A variety of database lists are available, such as the one published in the Nucleic Acid Researchsupplemental edition each January, and researchers will find the large national and international databases—such as NCBI, EBI, DDBJ, and SWISS-PROT—to be good places to start their search. “They often have pointers to where the databases are, ” Brutlag noted. Relevant data will more than likely come from a number of different databases, he added. “To do a complete search, you need to know probably several databases. Just handling one isn't sufficient to answer a biologic question.” The reason lies in the growing integration of biology, Karp said. “Many databases are organized around a single type of experimental data, be it nucleotide-sequence data or protein-structure data, yet many questions of interest can be answered only by integrating across multiple databases, by combining information from many sources. ”
The potential of such integration is perhaps the most intriguing thing about the growth of biologic databases. Integration holds the promise of fundamentally transforming how biologic research is done, allowing researchers to synthesize information and make connections among many types of experiments in ways that have never before been possible; but it also poses the most difficult challenge to those who develop and use the databases. “The problem,” Karp explained, “is that interaction with a collection of databases should be as seamless as interaction with any single member of the collection. We would like users to be able to browse a whole collection of databases or to submit complex queries and analytic computations to a whole collection of databases as easily as they can now for a single database.” But integrating databases in this way has proved exceptionally difficult because the databases are so different.
“We have many disciplines, many subfields,” said Gio Wiederhold, of Stanford University's Computer Science Department, “and they are autonomous—and must remain autonomous—to set their own standards of quality and make progress in their own areas. We can't do without that heterogeneity.” At the same time, however, “the heterogeneity that we find in all the sources inhibits integration.” The result is what computer scientists call “the interoperability problem,” which is actually not a single difficulty, but rather a group of related problems that arise when researchers attempt to work with multiple databases. More generally, the problem arises when different kinds of software are to be used in an integrated manner.