Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
MAPPING HEREDITY: USING PROBABILISTIC MODELS AND ALGORITHMS TO MAP GENES AND GENOMES 25 Chapter 2â Mapping Heredity: Using Probabilistic Models and Algorithms to Map Genes and Genomes Eric S. Lander Whitehead Institute for Biomedical Research and Massachusetts Institute of Technology For scientists hunting for the genetic basis of inherited diseases, the human genome is a vast place to search. Genetic diseases can involve such subtle alterations as a one-letter misspelling in 3 billion letters of genetic information. To make the task feasible, geneticists narrow down genes in a hierarchical fashion by using various types of maps. Two of the most important mapsâgenetic maps and physical mapsâdepend intimately on mathematical and statistical analysis. This chapter describes how the search for disease genes touches on such diverse topics as the extreme behavior of Gaussian diffusion processes and the use of combinatorial algorithms for characterizing graphs. The human genome is a vast jungle in which to hunt for genes causing inherited diseases. Even a one-letter error in the 3Ã09 base pairs of deoxyribonucleic acid (DNA) inherited from either parent may be sufficient to cause a disease. Thus, to detect inherited diseases, one must be able to detect mistakes present at just over 1 part in 1010. The task is sometimes likened to finding a needle in a haystack, but this analogy actually understates the problem: the typical 2-gram needle in a 6,000-kilogram haystack represents a 1,000-fold larger target. In certain respects, the gene hunter's task is harder still, because it may be difficult to recognize the target even if one stumbles upon it.
MAPPING HEREDITY: USING PROBABILISTIC MODELS AND ALGORITHMS TO MAP GENES AND GENOMES 26 The human genome is divided into 23 chromosome pairs, consisting of 1 pair of sex chromosomes (XX or XY) and 22 pairs of autosomes. The number of genes in the 3 Ã 109 nucleotides of the human DNA sequence is uncertain, although a reasonable guess is 50,000 to 100,000, based on the estimate that a typical gene is about 30,000 nucleotides long. This estimate is only rough, because genes can vary from 200 base pairs to 2 Ã 106 base pairs in length, and because it is hard to draw a truly random sample. Although molecular biologists refer to the human genome as if it were well defined in mathematicians' terms, it is recognized that, except for identical twins, no two humans have identical DNA sequences. Two genomes chosen from the human population are about 99.9 percent identical, affirming our common heritage as a species. But the 0.1 percent variation translates into some 3 million sequence differences, pointing to each individual's uniqueness. Common sites of sequence variations are called DNA polymorphisms. Most polymorphisms are thought to be functionally unimportant variationsâarising by mutation, having no deleterious consequences, and increasing (and decreasing) in frequency by stochastic drift. The presence of considerable DNA polymorphism in the population has sobering consequences for disease hunting. Even if it were straightforward to determine the entire DNA sequence of individuals (in fact, determining a single human sequence is the focus of the entire Human Genome Project), one could not find the gene for cystic fibrosis (CF) simply by comparing the sequences of a CF patient and an unaffected person: there would be too many polymorphisms. How then does a geneticist find the genes responsible for cystic fibrosis, diabetes, or heart disease? The answer is to proceed hierarchically. The first step is to use a technique called genetic mapping to narrow down the location of the gene to about 1/1,000 of the human genome. The second step is to use a technique called physical mapping to clone the DNA from this region and to use molecular biological tools to identify all the genes. The third step is to identify candidate genes (based on the pattern of gene expression in different tissues and at different times) and look for functional sequence differences in the DNA (for example, mutations that introduce stop codons or that change crucial amino acids in a protein sequence) of affected patients. This chapter focuses on genetic mapping and physical mapping, because it turns out that each intimately involves mathematical analysis.