the catalogues of the culture collections one by one to get an answer. Collecting online information coming from autonomous and heterogeneous data providers is the sort of job a web spider does, so we decided to look into building this kind of infrastructure. We also decided not to build the software platform as a monolithic structure, but make it flexible in the sense that it could take into account regional projects that had already established portals for a number of culture collections, in various countries or in regions like Asia.
The idea is that if a researcher has a question about a microorganism, instead of having to go to the online catalogues of the individual culture collections, the system would do it for the researcher. The Internet is conceived as a collection of data that is linked together by hyperlinks. These hyperlinks indicate connections points within and between various datasets. But the Internet does not lend itself very well to discovering new links and new ways of finding compatibilities between different datasets.
The approach we took to link microorganisms with all their downstream information is inspired by the “knuckles-and-nodes” model described by Lincoln Stein . The idea is to organize nodes of information into a number of thematic networks, each with its own hub, or knuckle, that interconnects with all the other networks through the knuckles. Some of these knuckles have already been established, so we simply needed to integrate them. The Bergey’s Manual, as the previous speaker described, could serve as a taxonomy knuckle, and a variety of bioinformatics knuckles are also available (e.g., public sequence databases bundled into the International Nucleotide Sequence Database Collaboration; INSDC). But what was missing was the organism knuckle, which would provide access to all the bacterial, archaeal and fungal resources that are in the culture collections and, by extension, in all public and private research collections.
So here’s another look at Lincoln Stein’s idea: a number of people put their data online in databases or simply as text documents. To bundle all this information together in what he calls knuckles requires the construction of some sort of integration network that helps discovery across these disparate data sources. One possible approach to accomplish this could be to build an infrastructure on top of the disparate data sources where globally unique identifiers are assigned in an ongoing discovery process of pointers between autonomous and heterogeneous data sources.
By following this approach, you can test various hypotheses and answer different questions about the data. One particular question we focused on was to estimate how many organisms for which the complete genome sequence is available from public databases are also available from public culture collections. To get an answer on this question, we took the integrated information from the culture collections and simply linked it with the Genomes OnLine Database (GOLD; www.genomesonline.org) . What we found was a tremendous gap between the availability of genomic information and the availability of the sequenced organisms in public culture collections.
In bacterial taxonomy there is a rule that states that if you want to describe a novel species, you have to deposit its type strain in at least two culture collections in two different countries. This to safeguard that the species remain available for further research. A similar rule is not required for when depositing and publishing the complete genome sequence of an organism. It seems natural that researchers would make the biological material available in order to add value to their publication of a whole-genome sequence. However, the results of our investigation show that more than 50 percent of the complete genome sequences that have been deposited in the public sequence databases do not have a publicly accessible organism.