The human genome is divided into 23 chromosome pairs, consisting of 1 pair of sex chromosomes (XX or XY) and 22 pairs of autosomes. The number of genes in the 3 × 109 nucleotides of the human DNA sequence is uncertain, although a reasonable guess is 50,000 to 100,000, based on the estimate that a typical gene is about 30,000 nucleotides long. This estimate is only rough, because genes can vary from 200 base pairs to 2 × 106 base pairs in length, and because it is hard to draw a truly random sample.
Although molecular biologists refer to the human genome as if it were well defined in mathematicians' terms, it is recognized that, except for identical twins, no two humans have identical DNA sequences. Two genomes chosen from the human population are about 99.9 percent identical, affirming our common heritage as a species. But the 0.1 percent variation translates into some 3 million sequence differences, pointing to each individual's uniqueness. Common sites of sequence variations are called DNA polymorphisms. Most polymorphisms are thought to be functionally unimportant variationsarising by mutation, having no deleterious consequences, and increasing (and decreasing) in frequency by stochastic drift. The presence of considerable DNA polymorphism in the population has sobering consequences for disease hunting. Even if it were straightforward to determine the entire DNA sequence of individuals (in fact, determining a single human sequence is the focus of the entire Human Genome Project), one could not find the gene for cystic fibrosis (CF) simply by comparing the sequences of a CF patient and an unaffected person: there would be too many polymorphisms.
How then does a geneticist find the genes responsible for cystic fibrosis, diabetes, or heart disease? The answer is to proceed hierarchically. The first step is to use a technique called genetic mapping to narrow down the location of the gene to about 1/1,000 of the human genome. The second step is to use a technique called physical mapping to clone the DNA from this region and to use molecular biological tools to identify all the genes. The third step is to identify candidate genes (based on the pattern of gene expression in different tissues and at different times) and look for functional sequence differences in the DNA (for example, mutations that introduce stop codons or that change crucial amino acids in a protein sequence) of affected patients. This chapter focuses on genetic mapping and physical mapping, because it turns out that each intimately involves mathematical analysis.