were 1,110 published complete genomes in the public literature. There are also 111 archaeal complete genomes, 3,342 ongoing bacterial projects, 1,165 ongoing eukaryotic genomes; and 200 metagenomes, for a total of nearly 6,000 sequencing projects of biological organisms that are in various stages of completion. It will be a big challenge to deal effectively with all this.42
In the future, single-cell projects will provide another major source of data. It is extraordinarily exciting to be able to sequence the genome of a single cell without growing it. It will also be another source of microbial data however, with which a commons is going to have to deal.
The data flood is not stopping. It is not leveling off. It is increasing. Potential future projects that the Joint Genome Institute is talking about are in the terabase range—trillions of base-pairs. The institute is also engaged in some international projects.
All of this information is deposited in the Integrated Microbial Genomes (IMG) system. The IMG is a data management and analysis platform designed to get value from the sequence data produced by the Joint Genome Institute and other places.
Another facility that we support is the Environmental Molecular Sciences Laboratory (EMSL), which has high-throughput capabilities in nuclear magnetic resonance, mass spectrometry, reaction chemistries, molecular sciences computing, and so forth. We are aggressively exploring ways of putting these two facilities together.
In the future, we hope to issue a call for projects that entail both Joint Genome Institute sequencing and EMSL proteomic analyses—the kinds of projects that neither of those two facilities could do by itself but which, if they work together, can be tremendously valuable and provide yet another kind of data that a commons would want to include.
Our data sharing policies state that any publishable information resulting from research that we have funded “must conform to community recognized standard formats when they exist, be clearly attributable, and be deposited within a community recognized public database(s) appropriate for the research conducted.” There is no time element here, and it is left up to the community to determine what the standards should be. In sequencing, we have moved to the immediate release of raw reads, and reserved analyses of more than 6 months are discouraged. Twelve months is the absolute maximum we will hold onto data without releasing it. A reserved analysis is anything that would compete with the stated scientific aims of the submitter of the project. We are also launching a knowledge base initiative to accelerate research and integration and cross-referencing of data.
To sum up, there is just so much data being produced so rapidly that you feel that the rest of biology is not keeping up. I think this effort by the National Research Council is critically important.
42 As of the end of February, 2011, there were 1,627 published complete genomes in the public literature. There are also 211 archaeal complete genomes, 5,790 ongoing bacterial projects, 2,002 ongoing eukaryotic genomes; and 308 metagenomes, for a total of nearly 10,000 sequencing projects of biological organisms that are in various stages of completion. Source: Genomes On Line database, http://www.genomesonline.org/cgi-bin/GOLD/bin/gold.cgi. This only underscores the challenges that collectively we (and a microbial commons effort) face.