Page 25

Chapter 2—
Mapping Heredity:
Using Probabilistic Models and Algorithms to Map Genes and Genomes

Eric S. Lander
Whitehead Institute for Biomedical Research and Massachusetts Institute of Technology

For scientists hunting for the genetic basis of inherited diseases, the human genome is a vast place to search. Genetic diseases can involve such subtle alterations as a one-letter misspelling in 3 billion letters of genetic information. To make the task feasible, geneticists narrow down genes in a hierarchical fashion by using various types of maps. Two of the most important maps—genetic maps and physical maps—depend intimately on mathematical and statistical analysis. This chapter describes how the search for disease genes touches on such diverse topics as the extreme behavior of Gaussian diffusion processes and the use of combinatorial algorithms for characterizing graphs.

The human genome is a vast jungle in which to hunt for genes causing inherited diseases. Even a one-letter error in the 3×09 base pairs of deoxyribonucleic acid (DNA) inherited from either parent may be sufficient to cause a disease. Thus, to detect inherited diseases, one must be able to detect mistakes present at just over 1 part in 1010. The task is sometimes likened to finding a needle in a haystack, but this analogy actually understates the problem: the typical 2-gram needle in a 6,000-kilogram haystack represents a 1,000-fold larger target. In certain respects, the gene hunter's task is harder still, because it may be difficult to recognize the target even if one stumbles upon it.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 25
Page 25 Chapter 2— Mapping Heredity: Using Probabilistic Models and Algorithms to Map Genes and Genomes Eric S. Lander Whitehead Institute for Biomedical Research and Massachusetts Institute of Technology For scientists hunting for the genetic basis of inherited diseases, the human genome is a vast place to search. Genetic diseases can involve such subtle alterations as a one-letter misspelling in 3 billion letters of genetic information. To make the task feasible, geneticists narrow down genes in a hierarchical fashion by using various types of maps. Two of the most important maps—genetic maps and physical maps—depend intimately on mathematical and statistical analysis. This chapter describes how the search for disease genes touches on such diverse topics as the extreme behavior of Gaussian diffusion processes and the use of combinatorial algorithms for characterizing graphs. The human genome is a vast jungle in which to hunt for genes causing inherited diseases. Even a one-letter error in the 3×09 base pairs of deoxyribonucleic acid (DNA) inherited from either parent may be sufficient to cause a disease. Thus, to detect inherited diseases, one must be able to detect mistakes present at just over 1 part in 1010. The task is sometimes likened to finding a needle in a haystack, but this analogy actually understates the problem: the typical 2-gram needle in a 6,000-kilogram haystack represents a 1,000-fold larger target. In certain respects, the gene hunter's task is harder still, because it may be difficult to recognize the target even if one stumbles upon it.

OCR for page 25
Page 26 The human genome is divided into 23 chromosome pairs, consisting of 1 pair of sex chromosomes (XX or XY) and 22 pairs of autosomes. The number of genes in the 3 × 109 nucleotides of the human DNA sequence is uncertain, although a reasonable guess is 50,000 to 100,000, based on the estimate that a typical gene is about 30,000 nucleotides long. This estimate is only rough, because genes can vary from 200 base pairs to 2 × 106 base pairs in length, and because it is hard to draw a truly random sample. Although molecular biologists refer to the human genome as if it were well defined in mathematicians' terms, it is recognized that, except for identical twins, no two humans have identical DNA sequences. Two genomes chosen from the human population are about 99.9 percent identical, affirming our common heritage as a species. But the 0.1 percent variation translates into some 3 million sequence differences, pointing to each individual's uniqueness. Common sites of sequence variations are called DNA polymorphisms. Most polymorphisms are thought to be functionally unimportant variations—arising by mutation, having no deleterious consequences, and increasing (and decreasing) in frequency by stochastic drift. The presence of considerable DNA polymorphism in the population has sobering consequences for disease hunting. Even if it were straightforward to determine the entire DNA sequence of individuals (in fact, determining a single human sequence is the focus of the entire Human Genome Project), one could not find the gene for cystic fibrosis (CF) simply by comparing the sequences of a CF patient and an unaffected person: there would be too many polymorphisms. How then does a geneticist find the genes responsible for cystic fibrosis, diabetes, or heart disease? The answer is to proceed hierarchically. The first step is to use a technique called genetic mapping to narrow down the location of the gene to about 1/1,000 of the human genome. The second step is to use a technique called physical mapping to clone the DNA from this region and to use molecular biological tools to identify all the genes. The third step is to identify candidate genes (based on the pattern of gene expression in different tissues and at different times) and look for functional sequence differences in the DNA (for example, mutations that introduce stop codons or that change crucial amino acids in a protein sequence) of affected patients. This chapter focuses on genetic mapping and physical mapping, because it turns out that each intimately involves mathematical analysis.

OCR for page 25
Page 27 Genetic Mapping The Concept of Genetic Maps Genetic mapping is based on the perhaps counterintuitive notion that it is possible to find where a gene is without knowing what it is. Specifically, it is possible to identify the location of an unknown disease-causing gene by correlating the inheritance pattern of the disease in families with the inheritance pattern of known genetic markers. To understand the foundation of genetic mapping, it is useful to return to the work of Gregor Mendel. Based on his experiments with peas (see Chapter 1), Mendel concluded that individuals possess two copies, called alleles, of each gene. Mendel's Laws of Inheritance are as follows: ·First Law. For any gene, each parent transmits one allele chosen at random to its offspring. ·Second Law. For any two genes, the alleles transmitted by a parent are independent (that is, there is no correlation in the alleles transmitted). Although Mendel's First Law has held up well over the past 130 years, the Second Law turned out to be false in general. Two genes on different chromosomes show no correlation in their inheritance pattern, but genes on the same chromosome typically show correlation. Consider the backcross in Figure 2.1, showing the inheritance of two genes A and B on the same chromosome. The F1 individual carries one chromosome with alleles a1 and b1 at the two genes and another chromosome with alleles a2 and b2. Often, one or the other chromosome is transmitted completely intact to the offspring. If this always happened, the inheritance pattern at the two genes would be completely dependent: a1 would always be co-inherited with b1. But the situation is more interesting. Crossing over can occur at random points along the chromosomes, involving an even swap of DNA material. If a crossover occurs between genes A and B, it results in recombination between the genes, producing a chromosome carrying a new combination of alleles: alb2 or a2b1. In fact, multiple crossovers can occur along a chromosome;

OCR for page 25
Page 28 Figure 2.1  Schematic drawing of genetic recombination in an F 1 heterozygote with distinct alleles at two loci (marked as A and B) on a chromosome. When no recombination occurs between A and B in meiosis, chromosomes carrying the original pair of alleles result. When recombination occurs, the resulting chromosomes carry a new combination of alleles. recombination between two loci will result whenever an odd number of crossovers occur. Genetic mapping is based on the recognition that the recombination frequency q between two genes (or loci) provides a measure of the distance between them. If two genes are close together, q will be small. If they lie farther apart, q will be larger. If the recombination frequency is significantly less than 0.50, the genes are said to be linked. The first genetic linkage map (Figure 2.2), showing the location of six genes in the fruit fly Drosophila melanogaster, was constructed in 1911 by Alfred Sturtevant, when he was still a sophomore at Columbia University en route to a career as a great geneticist. Genetic maps were a triumph of abstract mathematical reasoning: Sturtevant was able to chart the location of mutations affecting fly development—even though he understood neither the biochemical basis of the defects nor even that genes were made of DNA! The genetic distance dA,B between two genes A and B (measured in units called morgans, after the fly geneticist Thomas Hunt Morgan) is defined as the expected number of crossovers between the genes. If one assumes that crossovers are distributed independently with respect to one

OCR for page 25
Page 29 Figure 2.2  The first genetic map, showing six loci  on the  Drosophila  X chromosome, was constructed by Alfred Sturtevant in 1911. another, genetic distance can easily be converted into recombination frequency. (This assumption of independence is not quite right but is adequate for many purposes.) For the number of crossovers between genes A and B will then be Poisson distributed with mean d = dA,B, and so the probability of an odd number of crossovers can be shown (by summing alternate terms of the Poisson distribution) to be This equation, relating recombination frequency to genetic distance, is known as Haldane's mapping function. For small distances, the formula reduces to q » d, which reflects the fact that the possibility of more than one crossover can be neglected. For large distances d, the recombination frequency q approaches 0.50—that is, independent assortment. Mammalian chromosomes are typically 1 to 3 morgans in length. Geneticists typically report distances between genes in centimorgans. Incidentally, genetic distance between two genes is not necessarily proportional to the physical distance measured in nucleotides. Since crossing over is a biological process carried out by enzymes acting on the chromosome, the distribution of crossovers need not be (and typically is not) uniform with respect to the DNA sequence. Accordingly, molecular geneticists work with two different kinds of maps: genetic maps based on crossover frequency and physical maps based on nucleotide distances. There can be considerable inhomogeneity between the maps, although human geneticists often employ the rough rule of thumb of 1 centimorgan »1 megabase. (The relationship between genetic and physical length varies among organisms: 1 centimorgan is about 2 megabases in the mouse, about 200 kilobases in the nematode worm Caenorhabditis elegans, and about 3 kilobases in yeast.)

OCR for page 25
Page 30 Genetic mapping is an essential first step in characterizing a new mutation causing an interesting phenotype (that is, trait). Consider first the situation of (1) a laboratory organism in which experimental matings can be set up at will and (2) traits that are monogenic and fully penetrant (that is, the phenotype is completely determined by the genotype at a single gene). For example, a Drosophila geneticist might find a dominantly acting mutation at a locus A causing flies to have an extra set of wings (in fact, such mutations exist). He would set up crosses with strains carrying different genetic markers (that is, variants in other genes of known location) in order to find the regions showing correlated inheritance. Figure 2.3A shows a backcross of this type. The gene A is clearly not linked to locus B but is tightly linked to locus C. The proportion of recombinant chromosomes provides a straightforward statistical estimator of the recombination frequency. In this case, the recombination frequency between A and C is about 20/200 = 10 percent. The gene A can be positioned more precisely by using the three-point cross shown in Figure 2.3B, in which two nearby genetic markers are segregating. Here, it is clear that A maps about midway between genes C and D (see figure caption). For experimental organisms and simple traits, genetic mapping provides a straightforward way to locate the trait-causing gene to a small interval—typically about I centimorgan or less. In essence, one need only ''count recombinants." Because the analysis is so easy, Drosophila geneticists rarely need to appeal to statistical or mathematical concepts. For geneticists studying human families or complex traits, however, the situation is quite different. Challenges of Genetic Mapping: Human Families and Complex Traits Medical geneticists studying diseases face two major problems: (1) for human diseases, one cannot arrange matings at will but rather must retrospectively interpret existing families; and (2) for both human diseases and animal models of these diseases, the trait may not be simply related to the genotype at a single gene. Owing to these complications, genetic mapping of disease genes often requires sophisticated mathematical analysis.

OCR for page 25
Page 31 The first problem is the inability to arrange matings. To offset this limitation, human geneticists need to have a huge collection of frequent, naturally occurring genetic markers so that the inheritance pattern of each chromosomal region can be followed just as if one had deliberately set up a cross incorporating specific genetic markers. Throughout most of the century, only a small number of such naturally occurring genetic markers were known (an example is the ABO blood types), and thus human genetic mapping remained a dormant field. In 1980, David Botstein set off a revolution by recognizing that naturally occurring DNA polymorphisms in the human population filled the need (Botstein et al., 1980). By 1994, over 4,000 DNA polymorphisms had been identified and mapped relative to one another. Even with a dense genetic map of DNA polymorphisms, human genetic mapping confronts several special problems of incomplete information: · For individuals homozygous (a1 / a1) at a gene, one cannot distinguish at this location between the two homologous chromosomes (that is, the maternally and paternally inherited copies of the chromosome). · For individuals heterozygous (a1 / a2) at a gene, one cannot tell which allele is on the paternal chromosome and which is on the maternal chromosome unless one can study the individual's parents. · Information for deceased individuals (or for those who choose not to participate in a genetic study) is completely missing from the pedigree. Because of these uncertainties, one often cannot simply count recombinants to estimate recombination frequencies. Another problem is that many traits and diseases do not follow simple Mendelian rules of inheritance. This problem has several aspects: ·Incomplete penetrance. For some "disease genes," the probability that an individual inheriting the disease gene will have the disease phenotype may be less than 1. This probability is called the penetrance of the disease genotype. Penetrance may depend

OCR for page 25
Page 32 Figure 2.3 Examples of three-point crosses. (A) Locus A is  unlinked to locus B but is linked to locus C at a recombination fraction of 10 percent.

OCR for page 25
Page 33 Figure 2.3 (B) Locus A is located between loci C and D, at about 10 percent recombination  fraction from each. The first two types of progeny involve chromosomes with no recombination; the next four involve a single recombination, and the last two involve double recombination (between C-A and A-D). The double recombinant class is always least frequent, a property that allows one to determine the order of three linked loci from a cross in which they are all segregating. on other unknown genes, age, environmental exposure, or random chance. For example, a gene called BRCA1 on chromosome 17 predisposes to early onset of breast cancer in some women, but the penetrance is estimated to be about 60 percent by age 50 and 85 percent by age 80. As a result, one cannot conclude that an unaffected person has inherited a normal copy of the gene.

OCR for page 25
Page 34 · Phenocopy. Some diseases can be due to nongenetic causes. For example, colon cancer can be caused by mutations in the APC gene on human chromosome 5, but most cases of colon cancer are thought to be nongenetic in origin (and are often attributed to diet). As a result, one cannot conclude that an affected person has necessarily inherited the disease genotype. · Genetic heterogeneity. Some diseases may be caused by mutations in any one of several different genes. Thus, a disease may show linkage to a genetic marker in some families but not in others. · Polygenic inheritance. Some diseases may involve the interaction of mutations at several different genes simultaneously. Due to the incomplete information on natural families and the uncertainties of complex genetic traits, a human geneticist often cannot reliably infer an individual's genotype based on his or her phenotype; inferences are probabilistic at best. As a result, genetic mapping requires more sophisticated analytical methods than simply counting recombinants between a disease gene and nearby markers. Animal models of human diseases are slightly simpler, inasmuch as experimental crosses can be arranged. Still, interesting diseases typically show complex inheritance even in inbred animal strains. For example, mouse and rat models of diabetes involve incomplete penetrance, phenocopies, and polygenic inheritance. Sophisticated analytical tools are thus needed for such genetic mapping as well. Maximum Likelihood Estimation To handle the problem of incomplete information, geneticists have adopted the statistical approach of maximum likelihood estimation. Briefly sketched below is the basic formulation (see, e.g., Ott, 1991). In most cases, a geneticist needs to estimate a parameter q—for example, the recombination frequency between a disease gene and a genetic marker or the mean increase in blood pressure attributable to a putative gene at a specific location along the chromosome. The geneticist would ideally like to have complete genotypic data X—for example, the

OCR for page 25
Page 35 genotype for every family member, including the precise parental chromosome from which each allele was inherited. Given complete information, it is usually easy to estimate the required parameters—for example, the recombination frequency can be estimated by counting recombinant chromosomes, and the penetrance can be estimated by finding the proportion of individuals with a disease-predisposing genotype that manifests the disease. Unfortunately, one typically has only incomplete data Y from which it is difficult to estimate q directly. The maximum likelihood estimate is the value that makes the observed data Y most likely to have occurred, that is, the value that maximizes the likelihood function, L(q): = P(Y|q). Using Bayes Theorem, one can calculate L(q) by the equation: where the summation is taken over all possible values for the complete data X. The sum is easy to calculate in theory: it decomposes into various terms (corresponding to each individual and each genetic interval) that are conditionally independent, given complete data X. One can then calculate by numerical maximization methods. To determine whether is significantly different from a null value q 0 (for example, to see whether an estimated recombination frequency is significantly less than 50 percent), one examines the likelihood ratio Z = L() / L(q0). If Z exceeds some appropriate threshold T, a statistically significant effect has been found. In principle, virtually any genetic problem can be treated by this approach. In practice, however, two important issues arise: (1) efficient algorithms and (2) statistical significance. Efficient Algorithms The number of terms in the sum L(q) scales as roughly O(ecmn) where m is the number of people in the family, n is the number of genetic markers studied, and c is a constant. Except in the case of the smallest

OCR for page 25
Page 45 consecutive fragments, it is a good bet that they overlap. With this strategy, Kohara and colleagues (1987) constructed a complete physical map of the bacterium Escherichia coli with a genome of 4.6 million base pairs using phage clones containing fragments of about 15,000 base pairs. Restriction fragment sizes. Rather than constructing an ordered list of the restriction fragments, one can construct an unordered list. This turns out to be technically simpler, because one need not carefully control the rate of cutting as in partial digestion. Clones can instead be digested to completion and the fragment lengths measured. Although the unordered list contains less information, it can still provide an adequate fingerprint. For Figure 2.8 Schematic diagram illustrating restriction mapping of a DNA fragment by partial digestion. The DNA fragment at the top has several sites (denoted by E) that can be cleaved by the restriction enzyme EcoRI. A large collection of molecules of  this DNA fragment is radioactively labeled at one end (denoted by a star) and then exposed briefly to the restriction enzyme. The period of exposure is sufficiently brief that the enzyme can cleave only about one site per molecule, resulting in a collection of radioactively labeled fragments terminating at the various E sites. The length of these fragments (and thus the positions of the E  sites) can be determined by gel electrophoresis of the fragments and subsequent exposure of the gel to x-ray film.

OCR for page 25
Page 46 example, Olson and colleagues (1986) used this approach to construct a physical map of the yeast Saccharomyces cerevisiae with a genome of 13 million base pairs. · Content of sequence tagged sites. For very large genomes such as the human genome with 3 ´ 109 base pairs, it is necessary to work with large subclones of length > 100,000 base pairs. For such large subclones, a different fingerprinting strategy has gained favor in recent years. The method is based on sequence tagged sites (STSs), which are very short unique sequences taken from the genome which can be easily assayed by the polymerase chain reaction (PCR). The fingerprint of a clone is the list of STSs contained within it; the data form an incidence matrix of clones by STSs (Figure 2.9). Clones containing even a single unique STS in common should overlap. As an aside, the determination of which clones contain a given STS is typically made using a combinatorial pool scheme that avoids having to test each STS against each clone (Green and Olson, 1990). Using this approach, Foote et al. (1992) and Chumakov et al. (1992) constructed the first complete maps of human chromosomes (Y and 21, respectively). Regardless of the experimental details of the fingerprinting scheme, there are two key mathematical issues pertinent to the construction of a physical map: 1. Algorithms for map assembly. Given the fingerprinting data, what algorithm should be used for constructing a physical map? This question is closely related to graph theory: given information about adjacency among clones inferred from their fingerprints, one must reconstruct the underlying geometry of the physical map. 2. Statistics of coverage. How many clones must be studied to yield a map covering virtually the entire genome? This question belongs to probability theory: assuming that subclones are distributed randomly across the genome, one needs to know the distribution of gaps—uncovered regions or undetected overlaps—in the map.

OCR for page 25
Page 47 Figure 2.9 Schematic diagram illustrating the principle of STS content mapping. Various unique points in the genome, called STSs, are tested against a collection of random large-insert clones, such as YACs, to determine which STSs are contained in which YACs. Based on the resulting adjacency matrix, one  attempts to reconstruct the order of the STSs in the genome. "Contigs," consisting of groups of STSs connected by YACs, are assembled based on the adjacency data. In the figure, the STSs can be grouped into two contigs. Mathematical analysis is thus essential to the design and execution of physical mapping projects (Arratia et al., 1991; Lander and Waterman, 1988). This is illustrated in a discussion below of the considerations involved in making a physical map of the entire human genome. Excursion: Designing a Strategy to Map the Human Genome Under the auspices of the Human Genome Project, our laboratory is engaged in constructing complete physical maps of the mouse and human genomes, each about 3 x109 base pairs in length. The task is daunting, requiring analysis of tens of thousands of clones, each carrying extremely large DNA fragments. Before undertaking such a project, it was crucial to perform careful analysis to identify the best strategy. Currently, the best clones for making a human physical map are yeast artificial chromosomes (YACs). A good YAC library might contain inserts of about 1 million base pairs in length. Even with such large inserts, it would take 3,000 YACs to cover the human genome if they were laid end-to-end. Of course, clones taken from an actual library will be arrayed randomly, and so considerably more clones are required to ensure coverage. As noted above, the best fingerprint for studying YACs is STS content mapping. Each STS is screened simultaneously against the entire YAC library to identify the clones that contain it. Because STSs are screened in

OCR for page 25
Page 48 parallel, it is most efficient to work with a fixed YAC library and to test STSs sequentially. For mathematical analysis of physical mapping, the YACs and STSs can be abstracted to a set I of intervals (which may vary in size) and a set P of points distributed randomly along a line segment. An interval is said to be anchored if it contains at least one point p Î P. Two anchored intervals I1 and I2 are said to be connected if there is a point p Î P contained in their intersection. Note that two intervals may overlap but fail to be connected. If we take the transitive closure of the connectivity relation, the resulting equivalence classes of anchored intervals are called anchored "contigs." (For the purpose of the exposition, a definition is used that differs slightly from that in Arratia et al. (1991), in which contigs refer only to equivalence classes containing at least two intervals.) The key question is: How many intervals and how many points should be analyzed to construct a reasonably complete physical map—that is, one in which the vast majority of the genome is contained in a modest number of large contigs? We define the following notation: G, the length of the genome in base pairs; L, the length of a random clone in base pairs, a random variable; L, the expected length of a random clone, L = E(L); N, the number of clones to be used; M, the number of STSs to be used; a = LN/G, the expected number of clones covering a random STS; and b = LM/G, the expected number of STSs contained in a random clone. Clone lengths L will be assumed to be independent, identically distributed random variables, with the probability density function of the normalized length l = L/L denoted by f(l) and the inverse cumulative distribution function (also called the survival function) denoted F(I) = P(l/L > x). It is also useful to define the auxiliary function

OCR for page 25
Page 49 which can be interpreted as the probability that two points separated by distance x are not covered by a common clone. The problem belongs to the area of coverage problems, which treat processes of covering a space with random sets of a given sort. Often, mathematical authors focus on the goal of attaining complete coverage. Such results are not really appropriate from a biological standpoint—because they depend sensitively on the distribution of covering sets being absolutely random, an assumption that is biologically implausible. Instead, it is more sensible to focus on central behavior—that is, the goal of covering most of the space. STS content mapping poses a slightly unusual coverage problem, because the definition of coverage involves joining together random intervals with random points. It is nonetheless possible to analyze many features of the stochastic process in order to derive many prescriptive results. Arratia and colleagues (1991) proved the following result, which describes the basic coverage properties: Proposition: With the notation as above, (1) the expected number of anchored contigs is Np1, where (2) the expected length of an anchored contig is l E(L), where (3) the expected proportion r0 of the genome not covered by anchored contigs is

OCR for page 25
Page 50 Figure 2.10, taken from Arratia et al. (1991), plots these functions for the case of clones of constant size. From these graphs, experimentalists can plan their experimental approach. For our own physical mapping project in the human genome, the typical clone size is about 1 ´ 106 base pairs. Based on the trade-offs between screening more YACs and using more STSs, we selected a = 6 and b = 3—corresponding to about 18,000 YACs and about 9,000 STSs. This selection should ensure that about r0 » 99 percent of the genome is covered, with about 850 anchored contigs having average length of about 3.5 megabases. Having explored the question of experimental design, it is worth briefly discussing the issues involved in data analysis. The process of STS content mapping may consume several person-years of laboratory work, but the final result will simply consist of a large (18,000 ´ 9,000) adjacency matrix A = (aij), with aij = 1 or 0 in position i,,j according to whether YACi contains STSj. Based on this information, how do we determine the correct order of the STSs in the genome? In principle, a proposed order of the STSs is consistent with the observed data if and only if permuting the columns of the adjacency matrix A according to this order causes A to have the consecutive ones property-that is, in each row, the ones occur in a single consecutive block. This property follows from the fact that each YAC should consist of a single connected interval taken from the genome (see Figure 2.9). The consecutive ones property has been extensively studied in computer science. Booth and Leuker (1975) devised an elegant linear-time algorithm for solving the problem in a very strong sense: Given a (0,1)-matrix A with n rows and m nonzero entries, the algorithm needs a running time of only O(m + n) to determine whether there is any column permutation causing the matrix to have the consecutive ones property and, if so, to produce a simple representation of all such column permutations. In practice, there is a serious problem with this approach: it assumes that the data are absolutely error-free. However, laboratory work is never flawless and certainly not when the task involves filling in 162 million entries in an adjacency matrix. If even a few errors are present, the Booth-Leuker algorithm is almost certain to report that there is no consistent order! In fact, there are likely to be many errors, including

OCR for page 25
Page 51 · False negatives: one may fail to identify some proportion of the YACs containing an STS; · False positives: some proportion of the YACs detected as containing an STS may not actually do so; and · Chimeric YACs: some proportion of the YACs may not represent a single contiguous region, but two unrelated regions that have been joined together in a single clone. Moreover, the occurrence of false negatives and positives may not be random but systematic (owing to deletions of clones or contamination of samples). In short, algorithms must be robust to errors in the data. Producing such algorithms is an interesting challenge that draws on methods from graph theory, operations research, and statistics. As of this writing, the best approach has not yet been determined. Conclusion Genetic and physical mapping are key tools for describing the function and structure of chromosomes. Only in the simplest cases is such mapping completely devoid of mathematical issues. In the case of human genetics, mathematics plays a crucial role. In essence, mapping problems—like many problems in computational biology—involve indirect inference of the structure of a biological entity, such as a chromosome, based on whatever data can be effectively gathered in the laboratory. It is not surprising that mapping problems draw on statistics, probability, and combinatorics. Although the field of mapping dates nearly to the beginning of the 20th century, the area remains rich with new challenges—because new laboratory methods constantly push back the frontiers of the maps and features that can be mapped in DNA.

OCR for page 25
Page 52

OCR for page 25
Page 53 Figure 2.10 Expected coverage properties for STS content mapping, as a function of the coverage  a in YACs and b  in STSs. Calculations assume YACs of constant length  L  and a genome of length  G. The graphs show (A) the expected proportion of the genome covered by anchored ''contigs"; (C) the expected number of anchored contigs, and (D) the expected length of an anchored contig. Graphs show the situation for  a = 1,2,. . .,10 (only the cases a = 1,5,10 are explicitly marked). Results are expressed in units of  G/L.  Table B lists the value of  G/L  for certain representative genomes and cloning vectors, including two different sizes of YACs. Reprinted, by permission, from Arratia et al. (1991). Copyright © 1991 by  Genomics.

OCR for page 25
Page 54 References Arratia, R., E.S. Lander, S. Tavaré, and M.S. Waterman, 1991, "Genomic mapping by anchoring random clones: A mathematical analysis," Genomics 11, 806-827. Bodmer, W.F., C.J. Bailey, J. Bodmer, H.J.R. Bussey, A. Ellis, P. Gorman, F.C. Lucibello, V.A. Murday, S.H. Rider, P. Scambler, D. Sheer, E. Solomon, and N.K. Spurr, 1987, "Localization of the gene for familial adenomatous polyposis on chromosome 5," Nature 328, 614-616. Booth, K.S., and G.S. Leuker, 1975, "Testing the consecutive ones property, interval graphs, and graph planarity using PQ-tree algorithms," Journal of Computational Systems Science 13, 335-379. Botstein, D., R.L. White, M. Skolnick, and R.W. Davis, 1980, "Construction of a genetic linkage map in man using restriction fragment length polymorphisms," American Journal of Human Genetics 32, 314-331. Chumakov, I., P. Rigault, S. Guillou, P. Ougen, A. Billaut, G. Guasconi, P. Gervy, I. Le Gall, P. Soularue, and L. Grinas, 1992, "Continuum of overlapping clones spanning the entire human chromosome 21q," Nature 359, 380-387. Dietrich, W., H. Katz, S.E. Lincoln, H.-S. Shin, J. Friedman, N.C. Dracopoli, and E.S. Lander, 1992, "A genetic map of the mouse suitable for typing intraspecific crosses," Genetics 131, 423-447. Dietrich, W.F., E.S. Lander, J.S. Smith, A.R. Moser, K.A. Gould, C. Luongo, N. Borenstein, and W. Dove, 1993, "Genetic identification of Mom-1, a major modifier locus affecting Min-induced intestinal neoplasia in the mouse," Cell 75, 631-639. Elston, R.C., and J. Stewart, 1971, "A general model for the analysis of pedigree data," Human Heredity 21, 523-542. Feingold, E., P.O. Brown, and D. Siegmund, 1993, "Gaussian models for genetic linkage analysis using complete high resolution maps of identity by descent," American Journal of Human Genetics 53, 234-251. Foote, S., D. Vollrath, A. Hilton, and D.C. Page, 1992, "The human Y chromosome: Overlapping DNA clones spanning the euchromatic region," Science 258, 60-66. Green, E., and M.V. Olson, 1990, "Systematic screening of yeast artificial chromosome libraries using the polymerase chain reaction," Proceedings of the National Academy of Sciences USA 87, 1213-1217. Groden, J., A. Thliveris, W. Samowitz, M. Carlson, L. Gelbert, H. Albertsen, G. Joslyn, J. Stevens, L. Spirio, M. Robertson, L. Sergeant, K. Krapcho, E. Wolff, R. Burt, J.P. Hughes, J. Warrington, J. McPherson, J. Wasmuth, D. Le Paslier, H. Abderrahim, D. Cohen, M. Leppert, and R. White, 1991, "Identification and characterization of the familial adenomatous polyposis coli gene," Cell 66, 589-600. Kinzler, K.W., M.C. Nilbert, L.K. Su, B. Vogelstein, T.M. Bryan, D.B. Levy, K.J. Smith, A.C. Preisinger, P. Hedge, D. McKechnie, R. Finniear, A. Markham, J. Groffen, M.S. Boguski, S.F. Altschul, A. Horii, H. Ando, Y. Miyoshi, Y. Miki, I. Nishisho, and Y. Nakamura, 1991, "Identification of FAP locus genes from chromosome 5q21," Science 253, 661-665.

OCR for page 25
Page 55 Kohara, Y., A. Akiyama, and K. Isono, 1987, "The physical map of the whole E. coli chromosome: Applications of a new strategy for rapid analysis and sorting of a large genomic library," Cell 50, 495-508. Kong, A., 1991, "Efficient methods for computing linkage of recessive diseases in inbred pedigrees," Genetics and Epidemiology 8, 81-103. Kong, A., M. Frigge, N. Cox, and W.H. Wong, 1992a, "Linkage analysis with adjustments for covariates: A method combining peeling with Gibbs sampling," Cytogenetics and Cell Genetics 59, 208-210. Kong, A., M. Frigge, M. Irwin, and N. Cox, 1992b, "Importance sampling. I. Computing multimodel p values in linkage analysis," American Journal of Human Genetics 51, 1413-1429. Kong, A., N. Cox, M. Frigge, and M. Irwin, 1993, "Sequential imputation for multipoint linkage analysis," Genetics and Epidemiology 10, 483488. Lander, E.S., and D. Botstein, 1989, "Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps," Genetics 121, 185-199. Lander, E.S., and P. Green, 1987, "Construction of multilocus genetic linkage maps in humans," Proceedings of the National Academy of Sciences USA 84, 2363-2367. Lander, E.S., and M.S. Waterman, 1988, "Genomic mapping by fingerprinting random clones: A mathematical analysis," Genomics 2, 231-239. Leppert, M., M. Dobbs, P. Scambler, P. O'Connell, Y. Nakamura, D. Stauffer, S. Woodward, R. Burt, J. Hughes, E. Gardner, M. Lathrop, J. Wasmuth, J.M. Lalouel, and R. White, 1987, "The gene for familial polyposis coli maps to the long arm of chromosome 5," Science 238, 1411-1413. Moser, A.R., H.C. Pitot, and W.F. Dove, 1990, "A dominant mutation that predisposes to multiple intestinal neoplasia in the mouse," Science 247, 322-324. Nishisho, I., Y. Nakamura, Y. Miyoshi, Y. Miki, H. Ando, A. Horii, K. Koyama, J. Utsunomiya, S. Baba, P. Hedge, A. Markham, A.J. Krush, G. Peterson, S.R. Hamilton, M.C. Nilbert, D.B. Levy, T.M. Bryan, A.C. Preisinger, K.J. Smith, L.K. Su, K.W. Kinzler, and B. Vogelstein, 1991, "Mutations of chromosome 5q21 genes in FAP and colorectal cancer patients," Science 253, 665-669. Olson, M.V., J.E. Dutchik, M.Y. Graham, G.M. Brodeur, C. Helms, M. Frank, M. MacCollin, R. Scheinman, and T. Frank, 1986, "Random-clone strategy for genomic restriction mapping in yeast," Proceedings of the National Academy of Sciences USA 83, 7826-7830. Ott, J., 1991, Analysis of Human Genetic Linkage, Baltimore, Md.: Johns Hopkins University Press. Su, L.K., K.W. Kinzler, B. Vogelstein, A.C. Preisinger, A.R. Moser, C. Luongo, K.A. Gould, and W.F. Dove, 1992, "A germline mutation of the murine homolog of the APC gene causes multiple intestinal neoplasia," Science 256, 668-670. Thompson, E., and E. Wijsman, 1990, The Gibbs Sampler on Extended Pedigrees: Monte Carlo Methods for the Genetic Analysis of Complex Traits, Technical Report 193, Department of Statistics, University of Washington, Seattle.