(DDBJ), funded by the Ministry of Education, Culture, Sports, Science and Technology of Japan. A formal collaboration between these bodies (the International Nucleotide Sequence Database Collaboration1) ensures that the contents of all three are effectively identical at any time. Major achievements of the INSDC have been to make the submission of nucleic acid sequence data to one of the three databases mandatory for publication of any scientific paper that reports new sequence data2 and to define standards for such submissions. Today, all DNA sequencing done in the public sector is captured in the archives. Their extraordinary growth is shown in Figure 5-1.

It is no exaggeration to state that without the INSDC and the sequences stored in and made available through the collaborating databases, the success of the Human Genome Project and similar genome projects would have been impossible. These databases and the analytical tools whose development was encouraged by the free availability of data allow researchers to access the totality of the world’s public DNA and protein sequence data. It is vital for the metagenomics community to continue to adhere to accepted standards with respect to the public deposition of data from community projects3 and continue to encourage and enable the development of analytical tools and agreed-upon data-management practices (see also Chapter 6).

DNA sequence data submitted to the international archives are processed sequences, they are not the “raw” sequences directly from the sequencing machines. In the late 1990s researchers recognized that public access to the raw sequence “traces” would also be of great value. The National Center for Biotechnology Information, the Wellcome Trust, and the European Bioinformatics Institute (EBI) therefore established the Trace Archive4 for these data. In December 2006, the Trace Archive contained over 1.4 billion traces from over 700 species. Despite the challenges arising from some of the new sequencing methods, timely deposition of raw sequence data to the Trace Archive by the metagenomics community will also be of great long-term community benefit.

The nucleic acid sequence data archives are a primary source of experimentally determined DNA and RNA sequences. Many types of analyses of genomes and individual genes, however, require protein sequences. Although historically these were experimentally determined, the great majority are now computationally predicted from DNA sequence data. This requires



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement