The genome of an organism is the total genetic content of the organism, or more broadly, it is the organism’s entire DNA content—including nontranscribed, non-cis-regulatory regions of DNA such as centromeres and telomeres. The study of the genomes of organisms, which is called genomics, includes areas of research determining the genetic and physical maps of genomes, the DNA sequences of genomes, the functions of genes and proteins, the cis-regulatory elements of genes, and the time, place, and conditions of expression of genes. A prominent part of genomics has become the managing of the massive amount of gathered information (a field referred to as bioinformatics) and the analysis of data with regard to, for example, aspects of the organization of the genome, the comparison of genomes of different organisms, and the global patterns of expression of genes.
The Human Genome Project (HGP) was launched in October 1990 by the National Institutes of Health (NIH) as a federally funded initiative. The immediate goal was then, as it is now, to complete the accurate sequencing of the approximately 3.5 billion human DNA base pairs (the haploid amount) by the end of 2003 (F.S. Collins et al. 1998). A “rough draft sequence”, comprising approximately 90% of the human genome, was completed in mid-2000 (www.ornl.gov/hgmis/project/progress.html). In the longer term, a goal is to identify all human genes. Identification is difficult. In an organism such as yeast, which is favorable for the identification of genes by mutational genetic analysis, more than half the genes had gone undetected until the genome sequence became available (Brown and Botstein 1999). The lack of detection was in part due to large redundant regions of the yeast genome. In vertebrates, mutational genetic analysis is much more difficult, and redundancy might be more widespread. Therefore, initial gene identification by sequencing is the approach of choice. A gene is initially identified as an open reading frame (ORF), which can be discerned directly by looking at the sequence, or it is initially identified as an expressed sequence tag (EST) site, a sequence complementary to a known piece of transcribed RNA (see below). Thereafter, the goal is to identify each gene as a sequence encoding a full-length RNA and a protein of known function. The functions of nontranscribed regions, such as the numerous large cis-regulatory regions setting conditions for gene expression, will have to be elucidated as well. This task is still more difficult, currently involving a number of approaches, including the construction of transgenic animal lines carrying portions of the regulatory region in conjunction with a reporter gene (e.g., green fluorescent protein, GFP).
The functional analysis of the genome, in terms of the time and place of expression of genes and the functions of the gene products, is sometimes called “functional genomics” or even “post-genomics.” The analysis of a protein’s function might be simple if the protein sequence resembles that of another well-understood protein and might be difficult if sequence motifs are absent. The analysis of