another complementary strand, where A base-pairs with T, and G base-pairs with C. The strands have a chemically defined direction, so that the strands of the double helix are complementary and have opposite polarities. This is the key to molecular genetics; the double-stranded DNA separates into single strands, and each strand is then used to template a new double-stranded molecule. Protein molecules are themselves linear macromolecules that are written in the alphabet of 20 amino acids. Genes then encode proteins using the genetic code, a mapping from triplets of bases to the 20 amino acids (including three triplets that spell "stop").
In 1985 scientists began to discuss sequencing (reading the DNA) of the entire human genome. This was a huge leap from then-current technology. Only about 7 million letters were in the international databases; even if that was only half the then-sequenced DNA, it was a small fraction of the 3 billion letters of the human genome. And sequence length was daunting: a few sequences in the 50,000-letter range had been determined, but they were regarded as extraordinary feats. Reading a 15,000-letter sequence was more within the range of mortals. Even if chromosomes could be isolated, the 3 × 109 length would be reduced by only one order of magnitude. The first discussions were about creating a mega-project, where biology would be done in a factory. By the time the project had its official start in 1990, there was a big-science flavor to the project but the work was very distributed, involving many laboratories. The U.S. agencies funding the HGP are the National Institutes of Health and the Department of Energy; the project was to take 15 years and cost $3 × 109, a dollar per base. The HGP goals include the efficient and cost-effective development of genetic and physical maps of the human genome, the determination of the complete nucleotide sequence of the human genome, and the development of technology to achieve these goals. In addition, genetic and physical map and sequence information is to be collected for several model organisms such as mouse and yeast.
It was not evident to many that these goals were possible in the proposed time frame. Fortunately, progress in genome research has exceeded the initial guidelines, and a complete human genome sequence is hoped for by 2003. (See the recent review by Guyer and Collins (1995).) Some of the mathematical aspects of this project and of molecular biology are discussed in succeeding sections. GenBank is the U.S. public DNA database. Figure 1 shows GenBank growth since 1984; it is an indication of the explosion of knowledge in the biological sciences.
Human genomes are 99.9% identical; only 1 letter in 1,000 is different. Although any two human genomes are thus very closely related, one genome might encode Albert Einstein while another encodes John Lennon. Genetic variations can also cause cancer in one person and let another live to the age of 120. The differences matter! Genetic mapping uses variations in genomes to approximately locate a gene on a chromosome without knowing its precise identity.