such that obtaining sequence data is no longer considered to be research, but merely a routine technical procedure carried out in the course of research, largely with fully automated and easy-to-use equipment operated by technicians. Many laboratories own automated sequencers or use central sequencing facilities in their research institutions for routine sequencing tasks. Others contract the work out to private companies or sequencing centers around the world.

The first complete genome sequence of a virus was determined in 1975 when the 3,569-nucleotide genome of MS2, an RNA virus that infects bacteria, was sequenced (Fiers et al., 1976). By the end of 2003, the complete genome sequences of more than 1,100 viral species were available in public databases. The genomes of bacteria and eukaryotic species are far larger, but in recent years determination of these sequences has also become routine. The Institute for Genomic Research (TIGR), a nonprofit institution in Rockville, Maryland, that has been a major participant in many whole-genome sequencing projects, has built a powerful infrastructure for determining DNA sequences accurately and quickly. In 1995, TIGR scientists published the first complete genome sequence of a free-living organism, the pathogenic bacterium Haemophilus influenzae, which contains 1.8 million nucleotides (Fleischmann et al., 1995). By November 2003, complete sequences of 140 bacteria had been deposited in genome databases worldwide, and at least 181 more were being determined. The genomes of dozens of other eukaryotic organisms had also been completed by then, including plants, animals, insects, fungi, and the human.

Genome Data and Analysis

The primary data that DNA sequencing generates consist of a long list of the letters A, T, C, and G in what looks like no order. For whole genomes, the list can be very long. The human genome is more than 3 billion nucleotides long, for example, and the genome of Yersinia pestis, the bacterium that causes plague—and that devastated Europe in the Middle Ages—has about 4 million nucleotides.

To keep track of sequence data, the Los Alamos National Laboratory in 1982 opened a data repository called GenBank. The purpose was to create a single repository that would allow easy access to all sequence data as they became available. GenBank moved to the National Center for Biotechnology Information (NCBI) on the National Institutes of Health (NIH) campus in Bethesda, Maryland, in 1988 and has grown to an extraordinary degree in recent years. It now contains more than 30 million gene sequences from more than 130,000 species, comprising more than 36 billion nucleotides (GenBank, http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html).



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement