National Academies Press: OpenBook

Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology (1995)

Chapter: Excursion: Designing a Strategy to Map the Human Genome

« Previous: Assembling Physical Maps by "Fingerprinting" Random Clones
Suggested Citation:"Excursion: Designing a Strategy to Map the Human Genome." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 47
Suggested Citation:"Excursion: Designing a Strategy to Map the Human Genome." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 48
Suggested Citation:"Excursion: Designing a Strategy to Map the Human Genome." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 49
Suggested Citation:"Excursion: Designing a Strategy to Map the Human Genome." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
×
Page 50

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

MAPPING HEREDITY: USING PROBABILISTIC MODELS AND ALGORITHMS TO MAP GENES AND GENOMES 47 Figure 2.9 Schematic diagram illustrating the principle of STS content mapping. Various unique points in the genome, called STSs, are tested against a collection of random large-insert clones, such as YACs, to determine which STSs are contained in which YACs. Based on the resulting adjacency matrix, one attempts to reconstruct the order of the STSs in the genome. "Contigs," consisting of groups of STSs connected by YACs, are assembled based on the adjacency data. In the figure, the STSs can be grouped into two contigs. Mathematical analysis is thus essential to the design and execution of physical mapping projects (Arratia et al., 1991; Lander and Waterman, 1988). This is illustrated in a discussion below of the considerations involved in making a physical map of the entire human genome. Excursion: Designing a Strategy to Map the Human Genome Under the auspices of the Human Genome Project, our laboratory is engaged in constructing complete physical maps of the mouse and human genomes, each about 3 × 109 base pairs in length. The task is daunting, requiring analysis of tens of thousands of clones, each carrying extremely large DNA fragments. Before undertaking such a project, it was crucial to perform careful analysis to identify the best strategy. Currently, the best clones for making a human physical map are yeast artificial chromosomes (YACs). A good YAC library might contain inserts of about 1 million base pairs in length. Even with such large inserts, it would take 3,000 YACs to cover the human genome if they were laid end-to-end. Of course, clones taken from an actual library will be arrayed randomly, and so considerably more clones are required to ensure coverage. As noted above, the best fingerprint for studying YACs is STS content mapping. Each STS is screened simultaneously against the entire YAC library to identify the clones that contain it. Because STSs are screened in

MAPPING HEREDITY: USING PROBABILISTIC MODELS AND ALGORITHMS TO MAP GENES AND GENOMES 48 parallel, it is most efficient to work with a fixed YAC library and to test STSs sequentially. For mathematical analysis of physical mapping, the YACs and STSs can be abstracted to a set I of intervals (which may vary in size) and a set P of points distributed randomly along a line segment. An interval is said to be anchored if it contains at least one point p ∈ P. Two anchored intervals I1 and I2 are said to be connected if there is a point p ∈ P contained in their intersection. Note that two intervals may overlap but fail to be connected. If we take the transitive closure of the connectivity relation, the resulting equivalence classes of anchored intervals are called anchored "contigs." (For the purpose of the exposition, a definition is used that differs slightly from that in Arratia et al. (1991), in which contigs refer only to equivalence classes containing at least two intervals.) The key question is: How many intervals and how many points should be analyzed to construct a reasonably complete physical map—that is, one in which the vast majority of the genome is contained in a modest number of large contigs? We define the following notation: G, the length of the genome in base pairs; L, the length of a random clone in base pairs, a random variable; L, the expected length of a random clone, L = E(L); N, the number of clones to be used; M, the number of STSs to be used; a = LN/G, the expected number of clones covering a random STS; and b = LM/G, the expected number of STSs contained in a random clone. Clone lengths L will be assumed to be independent, identically distributed random variables, with the probability density function of the normalized length l = L/L denoted by f(l) and the inverse cumulative distribution function (also called the survival function) denoted F(I) = P(l/L > x). It is also useful to define the auxiliary function

MAPPING HEREDITY: USING PROBABILISTIC MODELS AND ALGORITHMS TO MAP GENES AND GENOMES 49 which can be interpreted as the probability that two points separated by distance x are not covered by a common clone. The problem belongs to the area of coverage problems, which treat processes of covering a space with random sets of a given sort. Often, mathematical authors focus on the goal of attaining complete coverage. Such results are not really appropriate from a biological standpoint—because they depend sensitively on the distribution of covering sets being absolutely random, an assumption that is biologically implausible. Instead, it is more sensible to focus on central behavior—that is, the goal of covering most of the space. STS content mapping poses a slightly unusual coverage problem, because the definition of coverage involves joining together random intervals with random points. It is nonetheless possible to analyze many features of the stochastic process in order to derive many prescriptive results. Arratia and colleagues (1991) proved the following result, which describes the basic coverage properties: Proposition: With the notation as above, (1) the expected number of anchored contigs is Np1, where (2) the expected length of an anchored contig is λ E(L), where (3) the expected proportion r0 of the genome not covered by anchored contigs is

MAPPING HEREDITY: USING PROBABILISTIC MODELS AND ALGORITHMS TO MAP GENES AND GENOMES 50 Figure 2.10, taken from Arratia et al. (1991), plots these functions for the case of clones of constant size. From these graphs, experimentalists can plan their experimental approach. For our own physical mapping project in the human genome, the typical clone size is about 1 × 106 base pairs. Based on the trade-offs between screening more YACs and using more STSs, we selected a = 6 and b = 3—corresponding to about 18,000 YACs and about 9,000 STSs. This selection should ensure that about r0 ≈ 99 percent of the genome is covered, with about 850 anchored contigs having average length of about 3.5 megabases. Having explored the question of experimental design, it is worth briefly discussing the issues involved in data analysis. The process of STS content mapping may consume several person-years of laboratory work, but the final result will simply consist of a large (18,000 × 9,000) adjacency matrix A = (aij), with aij = 1 or 0 in position i,,j according to whether YACi contains STSj Based on this information, how do we determine the correct order of the STSs in the genome? In principle, a proposed order of the STSs is consistent with the observed data if and only if permuting the columns of the adjacency matrix A according to this order causes A to have the consecutive ones property-that is, in each row, the ones occur in a single consecutive block. This property follows from the fact that each YAC should consist of a single connected interval taken from the genome (see Figure 2.9). The consecutive ones property has been extensively studied in computer science. Booth and Leuker (1975) devised an elegant linear- time algorithm for solving the problem in a very strong sense: Given a (0,1)-matrix A with n rows and m nonzero entries, the algorithm needs a running time of only O(m + n) to determine whether there is any column permutation causing the matrix to have the consecutive ones property and, if so, to produce a simple representation of all such column permutations. In practice, there is a serious problem with this approach: it assumes that the data are absolutely error-free. However, laboratory work is never flawless and certainly not when the task involves filling in 162 million entries in an adjacency matrix. If even a few errors are present, the Booth-Leuker algorithm is almost certain to report that there is no consistent order! In fact, there are likely to be many errors, including

Next: CONCLUSION »
Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology Get This Book
×
Buy Paperback | $80.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

As researchers have pursued biology's secrets to the molecular level, mathematical and computer sciences have played an increasingly important role—in genome mapping, population genetics, and even the controversial search for "Eve," hypothetical mother of the human race.

In this first-ever survey of the partnership between the two fields, leading experts look at how mathematical research and methods have made possible important discoveries in biology.

The volume explores how differential geometry, topology, and differential mechanics have allowed researchers to "wind" and "unwind" DNA's double helix to understand the phenomenon of supercoiling. It explains how mathematical tools are revealing the workings of enzymes and proteins. And it describes how mathematicians are detecting echoes from the origin of life by applying stochastic and statistical theory to the study of DNA sequences.

This informative and motivational book will be of interest to researchers, research administrators, and educators and students in mathematics, computer sciences, and biology.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!