Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Sequencing The nucleotide sequence of a genome is its physical map at the highest level of resolution. It provides all the information that goes into making up an individual's genetic complement, and no two individuals (except identical twins) share the same genome sequence. Rather, every human contains a duplicate copy of every chromosome, with about a 1.0 percent difference between the sequence of each of his or her two homologous chromosomes (that is, the average person in a population is heterozygous for about 1.0 percent of the nucleotide pairs, or approximately 30 million pairs). These differences have arisen by mutations accumulated over the course of evolutionary time, and most of them do not affect the normal functions of the individual. Any sequence derived from the human genome will be a prototype- a blueprint that will lay out the basic organization and sequence of the genes on the chromosomes. This prototype may be derived by forming a composite of regional sequences from many individuals; it need not represent the complete sequence of any one person. The nature of individual variation will become apparent when regions of interest are compared among individuals. In 1971, the first nucleotide sequence was obtained directly from DNA with the determination of the 12-nucleotide-Iong cohesive ends of the bacteriophage A (Wu and Taylor, 19711. Since that time, with the advent of rapid techniques, about 15 million nucleotides of DNA sequence have accumulated in the GenBank database (see Chapter 6), of which over 2 million are from human DNA (Howard Bilofsky of Bolt Beranek and Newman, Inc., personal communication, 19871. This figure represents approximately 0.07 percent of the human 56
SEQUENCING 57 genome. Thus, although human genomic sequencing has already begun, unless a special effort is initiated, the entire sequence will not be available for many decades, if ever. WHY SEQUENCE THE ENTIRE HUMAN GENOME? There is general agreement in the biological sciences community that a physical map of the human genome, as represented in a set of cloned overlapping DNA fragments, is a worthwhile goal. There is much less consensus on the advisability of embarking on the deter- mination of major amounts of its nucleotide sequence. Three kinds of reservations are often voiced: · Since the amount of useful protein-coding information in the genome is estimated to be 5 percent or less, a great deal of effort would be expended in determining the order of nucleotides of no apparent significance. If massive amounts of sequencing are to be done, why not sequence only large libraries of cDNAs instead? · Even if only cDNAs were sequenced, we would lack the ability to utilize the vast amount of sequence information generated. The problem will be even worse with a complete genome sequence. Therefore, the limited amount of usable knowledge gained is not worth the anticipated cost. · Even if the project is worthwhile, the intensive effort required will divert funds from other research aimed at understanding the structure and function of genes in all organisms, and, therefore, there will be a net loss rather than a net gain of important biological information. To address the first point, one must consider whether it would be less difficult to identify, before sequencing, the 5 percent of the genome that actually encodes proteins than to sequence the entire genome. Given the present state of sequencing technology, this is certainly the case; for this reason, we would anticipate that most human genome sequencing in the immediate future will be carried out on cDNA clones, which represent the expressed DNA sequence. However, it seems fair to assume that by the time sequencing begins on a massive scale, the technology will have matured so far that inserting a preliminary step that discriminates among genes, intergenic regions, and introns which will presumably involve sorting out all the repeated isolates of the same DNA clones will be less efficient than sequencing large regions from ordered genomic DNA clone libraries without reference to their contents. This, of course, assumes
58 MAPPING AND SEQUENCING TTIE HUMAN GENOME major technological advances in sequencing, as will be described subsequently. Another reason to sequence more than just cDNAs is that sequencing the entire genome is certain to reveal unsuspected sequences having important functions. For example, one of the great challenges of a genome sequencing project is to identify potentially important func- tional domains involved in gene regulation and chromosome organi- zation. The identities of such sequences will be elicited by multiple analytical approaches and will require sequence comparisons between the analogous intergenic regions in multiple species (including human versus mouse) and the recognition of unusual patterns of sequence within a single organism. As one example, the comparative sequencing of 3,500 nucleotides of a regulatory region from the engrailed gene of two different Drosophila species has revealed the presence of more than 50 short blocks of evolutionarily conserved sequences, most of which are suspected to represent the binding sites for different gene regulatory proteins is. Kassis and P. O' Farrell, University of Califor- nia, San Francisco, personal communication, 19871. Determining the function of each of the sequences will require experimental testing based on the sequence analysis, which can pinpoint even short sequences that deserve serious investigation by virtue of their con- servation during evolution. Sequence comparisons among different species will pick up genes readily as evolutionarily conserved sequences in the genomes. How- ever, such comparisons are rarely necessary for picking out coding sequences since existing analytical tools are adequate to identify them within a single DNA sequence. The standard procedure is to use computer programs to identify open reading frames, which are regions of nucleotide sequence lacking the stop codons that terminate a protein sequence. Practical experience shows that this information, when combined with codon usage patterns and other characteristics, allows one to identify virtually all genes in a nucleotide sequence, even though short exons will occasionally be missed (Staden and Mc- LachIan, 19821. Accurate specification of the coding sequences can then be obtained by a standard experimental analysis of the corre- sponding clones in a collection of cDNAs. Some of the genes discovered will have immediate significance to the biomedical com- munity because they are associated with a disease. Many others will be analyzed and found to contain homologies to existing gene families, an immediate clue to their possible function. As more gene sequences are determined, such relations among genes will be found with increasing frequency (as has recently happened, for example, among genes that encode cell surface proteins that bind specific protein
SEQUENCING 59 molecules that are involved in cell signaling), and entirely new gene families will be identified as well (Doolittle et al., 19861. Critics will rightly point out that a complete human genome sequence will make such a huge number of genes perhaps as many as 100,000- directly accessible that the function of the vast majority of them will remain unknown for many decades after the genome has been completely sequenced. Why then should one devote extra resources to speeding up the completion of the sequencing effort? The committee feels that much is to be gained from having a complete catalog of human gene sequences that does not require knowing the function of most of the individual genes. For example, scientists interested in the signaling actions of cyclic nucleotides will immediately be able to recognize a large group of genes that are likely to produce proteins that bind cyclic nucleotides. Specific antibodies can be prepared to each of these proteins and used to test for the role of each of them in any signaling pathway of interest. Whole families of proteins that are likely to mediate the signaling effect of calcium ions can be identifiecl in a similar way. Likewise, a large group of candidate human genes will be immediately available as potential analogs of any newly discovered yeast, nematode, or Drosophila protein, for example. Other novel uses of the genome sequence data, unforeseen at present, will be developed by individual scientists, just as many of the most important current uses of recombinant-DNA technology were not foreseen by its early developers. in short, we anticipate that the genome sequence will serve as a basic "dictionary" that catalyzes striking advances in our understanding of cells and organisms. In response to the third criticism, the committee specifically rec- ommends that the sequence of the human genome be determined in parallel with analogous sequencing of the genomes of the other organisms needed to interpret the human data. Thus, the basic research on these mode! organisms should be closely integrated with data on humans. In addition, the project must have independent funding so that it does not divert funds from ongoing basic studies, particularly those trying to understand the function of genes in in all organisms, because it is ultimately such research that will make the information on the human genome interpretable. Finally, a concerted sequencing effort will benefit a wide range of biological investigations. By pushing the development of sequencing technology and establishing sequencing centers, inexpensive sequenc- ing will become available to anyone who has a legitimate need for it. In this way the envisioned project will free individual laboratories from the routine and currently labor-intensive effort of sequencing their few genes of interest a necessary prelude to studies of gene
60 MAPPING AND SEQUENCING THE HUMAN GENOME function and regulated expression. It is extremely inefficient for each laboratory to set up the facilities needed to sequence the 100,000 to 1 million nucleotides that it finds of interest. Rather, the recommended project grows out of the recognition that elucidating nucleotide sequences (as distinct from sequence analysis) is ideally an exercise of production, not of research. Accumulating large amounts of DNA sequence data will have an impact on the biological community in other ways as well. The information contained within the genome sequence will allow full investigation into the nature and extent of polymorphism, or diversity (see Chapter 4), in the genes in the human population. Once genes with widespread diversity (such as the major histocompatibility anti- gens and T-cell receptor genes) are identified, comparative sequencing of a single gene or gene family in many individuals will naturally follow. Finally, the availability of structural information on a variety of genes will stimulate efforts to correlate protein coding domains, or exons, with protein fooling domains. It has been proposed that the segments of proteins encoder! by individual exons arose during evolution as small protein units capable of independent folding and that they have assembled into multifunctional proteins as independent domains (Gilbert, 197S, 1985~. By studying these correlations, one may learn much about the rules that govern the secondary and tertiary structure of proteins. Such spin-offs will be of great value to the biological community and are meant to augment its activities- not to detract from them. CURRENT TECHNOLOGY IN DNA SEQUENCING: CHEMICAL AND ENZYMATIC METHODS Any project to sequence a large genome with many repeated sequences would not start with short, randomly selected genome fragments, even though this is the easiest way to obtain a large amount of sequence information quickly. Most of the sequences obtained in this way would be short (perhaps 200 to 600 nucleotides), and millions of gaps would be left between them. Because most genes in humans extend for many thousands of nucleotides (Table 2-1), little information of biological value can be obtained from a collection of such short sequences. For this reason, sequencing would normally begin with a large cloned segment of DNA that would be sequenced completely. Such a DNA segment must first be subcloned into smaller, more manageable fragments. This can be done by one of three methods: · Generate a detailed restriction map, and determine from the map the identity of each subclone and its relation to the whole.
SEQUENCING 61 · Beginning at one end of the large segment, generate a series of successively smaller DNA fragments by a limited removal of nucleo- tides from the end with exonucleases (enzymes that hydrolyze the phosphodiester bonds that join nucleotides together starting at a chain end); clone the remaining DNA to produce a series of clones of known . . Orlgln. · Generate a totally random series of overlapping subclones, whose relationship to one another is revealed only after their sequencing. Large sequencing projects often mix all three strategies. One sometimes begins by randomly sequencing fragments and follows with directed sequencing of specific subclones as the gaps are located. All sequencing strategies require some redundancy in the form of over- lapping sequences in order to merge the results of several determi- nations from different subclones and to provide a check on the accuracy of the sequence, which requires the sequencing of both DNA strands as a cross-check on systematic errors. The subcloning method will determine to a large degree the amount of redundancy in the data. Although time-consuming during the subcloning process, the first and second subcloning methods ultimately require that any single segment be sequenced only about three times. The third method, because one is sequencing subclones at random, generally requires that each segment be sequenced approximately 10 times; however, methods are available to specifically select missing clones, after a three-fold coverage, which reduce the amount of redundant sequencing (Sanger et at., 19821. The ability to sequence large stretches of DNA became a reality in the middle to late 1970s with the independent development of two techniques. One of these, developed by Sanger and his colleagues at the Medical Research Council in Cambridge, England, is a method called enzymatic sequencing (Sanger et at., 19771. The unknown sequence is subcloned into a single-stranded DNA virus, and DNA synthesis is initiated from a primer sequence adjacent to the unknown sequence. This method utilizes the principle that when appropriately designed chain-terminating analogs of the four DNA nucleotides (A, G. C, and T) are incorporated into DNA by DNA polymerase, synthesis of the growing DNA chain is terminated. For example, if the synthesis of DNA molecules begins at a fixed point on a template in the presence of a low concentration of the A analog, the analog will infrequently be incorporated instead of the normal A nucleotide at any one position. However, when incorporation occurs, the syn- thesis of the chain stops. Thus a nested set of DNA fragments that terminate at every A nucleotide in the unknown sequence is generated.
62 MAPPING AND SEQUENCING THE HUMAN GENOME By correlating the length of the terminated chains with the identity of the base analog that was present in the reaction, one can determine the order of the nested DNA fragments and, hence, the corresponding nucleotide sequences (Figure S-11. At present, this method dominates DNA sequencing applications primarily because once the subclones are generated the procedure involves only a few simple steps. The second technique, which is referred to as chemical sequencing, was developed by Maxam and Gilbert at Harvard University (Maxam and Gilbert, 1977~. It uses chemicals that break the DNA chain at specific nucleotides. The DNA molecule is labeled at one end with a radioactive tag. It is then cleaved with each chemical separately in such a way as to generate breaks infrequently at any given nucleotide. As in the enzymatic sequencing technique, the DNA fragments are separated according to size, and the sizes are correlated with the nucleotide that is cleaved (Figure 5-2~. This method is generally more time-consuming than the enzymatic sequencing method, but it often produces fewer ambiguities in the interpretation of the data. Both methods generate mixtures of specific DNA fragments that are separated by polyacrylamide gel electrophoresis a technique that can resolve fragments that differ in size by a single nucleotide. When radioactively labeled DNA fragments are used, they are detected by exposing the gel to an x-ray film. That film, which has imprinted upon it a ladder of bands distributed over four parallel lanes representing the four nucleotides of DNA, must be interpreted or read by an experienced person and the data must be entered into a computer. Machines have been developed to expedite this process through the use of a stylus attached to a computer that points to each band on the x-ray film. The computer then registers the position and translates it into one of the four nucleotides of DNA. Attempts are now under way to develop x-ray film scanners capable of reading such films directly. In addition, automatic methods that use fluorescent labels have been introduced. It is critical that other strategies for reducing the human labor and error involved in this process be developed if the human genome is to be sequenced in a timely manner. THE DIFFICULTY OF DETERMINING THE SEQUENCE OF THE HUMAN GENOME WITH CURRENT TECHNOLOGY What constrains efforts to embark immediately on a large-scale human genome sequencing project? The cost and inefficiency of current DNA sequencing technologies are too great to make it feasible to contemplate determining the 3 billion nucleotides of the DNA sequence in the human genome within a reasonable time. The largest
SEQUENCING tar 0 3 - o 63 ~5=D~ DNA Fragment 1 ~- All G T C V ~ ~11. ~ / / A A G T C C _= _ G __ _ (S')GCAGATACGC(3') Sequence of end-labeled strand Denature to separate strands Anneal short end-labeled oligonucleotido to one strand Carry out DNA synthesis primed by the oligonucleotide In the presence of a small amount of the Indicated chain-terminating di deoxyrlbo n u cleos Ida trip h asp h ate All of the labeled strands In each tube will end with the corresponding nucleoUde Parallel gel electrophoresls and autoradlography will separate the labeled fragments of dlfferlng Icagth FIGURE 5-1 DNA sequencing by the enzymatic method. The key to this method is the use of a dideoxyribonucleoside triphosphate that blocks the addition of the next nucleotide after its incorporation into the growing chain. The primed in vitro synthesis of DNA molecules in the presence of a minor proportion of a single-type of such a chain-terminating nucleotide generates a family of DNA fragments each of which ends in the particular chain- terminating nucleotide (see also Figure 5-3). Here a radioactive DNA primer is used to initiate the synthesis of such DNA fragments and four different synthesis reactions-each with a different chain-terminating nucleotide-are analyzed by electrophoresis in four parallel lanes of a gel. The DNA sequence is then determined from the electrophoresis pattern.
64 MAPPING AND SEQUENCING THE HUMAN GENOME 1 Label ends DNA Fragment | Cut with restriction enzyme; J. separate pieces Isolate end- | labeled strand ~t Mel Hi\ if/ I C ~ G a C In A c D T A c' c G ~ C o G _: 3~ Discard C & T i 1 , / Reaction proceeds / long enough to produce an average of one break per strand; the random breaks generate fragments representing all positions of the indicated nucleotide A G C&T C - . At= l (5') G C A G A T A C G C (3') Sequence of end-labeled strand Expose each sample to different chemical reaction that breaks C DNA after the indicated nucleotide Parallel gel electrophoresis and autoradiography FIGURE 5-2 DNA sequencing by the chemical method. A DNA fragment that is radioactive only at its 5' end is the material to be sequenced. A different chemical reaction in each of four samples breaks the DNA fragment only (or mainly) at A, G. both C and T. and C residues, respectively. The labeled DNA subfragments created by these reactions all have the label at one end and the cleavage point at the other. Electrophoresis of each sample through a polyacrylamide gel then allows each DNA subfragment to be separated according to its size. After autoradiography of the gel, the four sets of labeled subfragments (one set per gel lane) together yield a radioactive band for each nucleotide in the original DNA fragment. Adapted, with permission, from Darnell et al. (1986).
SEQUENCING 65 contiguous segment of human DNA determined to date is the 150,000 nucleotides encoding the human growth-hormone gene. This is 0.005 percent of the total genome. Some other numbers are informative in this context. Currently, a skilled laboratory worker in a well-equipped facility can produce from about 50,000 nucleotides of finished DNA sequence per year (B. Barrell, Medical Research Council, Cambridge, personal communi- cation, 1987) to about 100,000 nucleotides of finished sequence per year (E. Chen, Genentech, personal communication, 19871. The cost of this sequence ranges from $1 to $2 per nucleotide, an estimate based on the assumption that one worker costs a laboratory approx- imately $100,000 per year, including salary, supplies, and overhead. Even at the upper estimate of 100,000 nucleotides sequenced per person per year (which has not yet been achieved in a sustained effort), determining the human genome sequence would require 30,000 person-years of work at a cost of $3 billion. Since the sequencing of the genomes of other species is essential for an understanding of the human genome, the actual amount of sequencing would approach 6 billion nucleotides, at a current cost of $6 billion. This high cost of sequencing reflects the fact that the endeavor is still highly labor intensive and does not include unforeseeable technical problems or technical improvements. Most of the time spent in a sequencing project is occupied with obtaining the original DNA clones containing the gene of interest and subcloning and handling the DNA prior to performing the actual sequencing reactions steps that have not yet been streamlined or automated. In addition, the entire process from subcloning to inter- preting gels requires careful supervision of personnel; a ratio of no more than three technicians for each doctoral-level scientist is generally accepted as optimal. The rate of DNA sequence determination is also limited by the fact that all techniques currently use polyacrylamide gels that resolve no more than 250 to 500 nucleotides at a time. At this level of resolution, hundreds of millions of individual sequence determinations would be required to complete the human genome, given the estimate that each sequence will need to be determined three times over. By increasing the length of the average contiguous sequence that can be determined on a single gel, considerable time and effort would be saved. THE ACCURACY OF DNA SEQUENCING Unless the human genome sequence is determined accurately, it will be of little use. Errors in DNA sequence determination occur at
66 MAPPING AND SEQUENCING THE HUMAN GENOME several levels. The most common is caused by insufficient resolution of adjacent DNA fragments in gel electrophoresis because of compres- sion in their migration (neighboring bands merge into one). These effects are especially prevalent in regions containing large numbers of G and C nucleotides. Aberrations in the sequencing reactions can also occur in stretches of unusual sequence. These problems are compounded by human error, such as when researchers attempt to guess the sequence in ambiguous regions and when sequence gels are read past the point of accurate resolution. Another common source of human error occurs in transcribing the data into the computer. One potential source of error that will become more common as large- scale sequencing is attempted resides in the presence of short, highly repetitive sequences in human DNA, which can be confused when they occur in multiple clones. Furthermore, the cloning process itself may introduce a few errors. The accuracy of DNA sequencing has not yet been firmly estab- lished. A careful and experienced laboratory probably achieves an accuracy of about one error in every 5,000 nucleotides (0.02 percent error rate) in the finished DNA sequence, but this degree of precision requires careful attention to virtually every nucleotide in the sequence (E. Chen, Genentech, personal communication, 19871. Such attention inevitably slows the sequencing rate. It will be difficult to hold the error rate to this level in a large-scale nucleotide sequencing project. Although a few investigators have achieved a 0.02 percent error rate, most careful workers can only achieve an error rate of 0.1 percent. It is important to consider the impact of this error rate in the sequence of the human genome. Although it might seem large, the committee believes it is tolerable. The estimated level of DNA sequence heterozygosity among individuals is about 1.0 percent. The errors in the DNA- sequence will be randomly placed, and hence most will occur outside coding sequences. Those errors in coding regions of genes that are either insertions or deletions of nucleotides (as most sequencing errors are) will have profound effects in that they will cause the reading frame to shift. This could lead to a failure to identify an exon as coding for part of a protein. If we consider that the average coding region (exon) is approximately 200 nucleotides long, one can anticipate that an error will occur on average in one of every five exons. The detection of some of these errors in exons may be facilitated by computer programs that predict coding regions on the basis of the use of particular sets of three nucleotides (codons) that code for each amino acid in humans. However, the errors will eventually be identified with certainty only by those interested in that region of the genome. This analysis puts into perspective the need to
SEQUENCING 67 aim for approximately 0.1 percent as the maximum acceptable error rate in the initial sequence produced. EMERGING AND FUTURE TECHNOLOGY The obvious mismatch between the efficiency of current DNA sequencing technology and the genetic complexity of genomes in even the simplest cells has given rise to several research projects aimed at developing more efficient sequencing methods. We seem to be on the threshold of a new generation of sequencing methods that should make large-scale sequencing projects more practical. Given the emer- gent state of these technologies, however, it is not surprising that expert opinion is widely divided on several key questions. · Which of several next-generation strategies will prove most effective? · Will the best next-generation strategy represent a quantum jump in sequencing capability or an incremental improvement that largely decreases the tedium of sequencing and shifts costs from skilled labor to instruments? · In looking ahead to the need for a series of cumulative 5- to 10- fold increases in sequencing capability, is the future likely to lie in scaling up automated techniques that are already at the prototype stage, or does it lie in revolutionary new methods? These questions will remain unanswered until future large-scale projects have been completed. Particularly crucial will be a determi- nation of the steps in a sequencing project that become rate-limiting as the goals of sequencing are increased. No foreseeable technology will be able to automate DNA sequencing comprehensively. DNA sequencing involves a complex series of experimental steps with very different prospects for automation. For this reason, a given incremental increase in efficiency at any one step will rarely result in a comparable increase in overall efficiency. Several current research projects aimed at automating various steps of sequencing are at different stages of development, and they illustrate the range of approaches being tested. They not only call for different technical strategies, but to various degrees they also reflect different perceptions of the steps in DNA sequencing most in need of greater efficiency. Several groups [California Institute of Technology, DuPont, and the European Molecular Biology Laboratory (EMBL)] are adapt- ing the basic enzymatic sequencing methodology to a more automated operation at the level of reading the sequencing gels. Others are developing automated film readers, which are less expensive and not
68 MAPPING AND SEQUENCING THE HUMAN GENOME limited by the slow rate of electrophoresis (Elder et al., 19851. Radioactive labeling of the fragments has been replaced by the use of fluorescent tags, which can be detected in the gel by characteristic light emissions evoked by laser illumination. The Cal Tech and DuPont methods allow more efficient use of the polyacrylamide gel, since the four reaction mixtures representing the four DNA nucleotide precur- sors can be labeled with different tags and then mixed together before being fractionated on a single gel lane (Figure 5-3) (Smith et al., 1986~. Alternatively, the EMBE method uses a single fluorescent tag for all four nucleotides and runs four gel lanes, which are monitored simul- taneously with radioactive tags. In both cases, fragments are detected as they migrate past the point of laser illumination at the bottom of the gel, which eliminates the need to expose, develop, and interpret x-ray film. In each case, multiple sequences can be simultaneously analyzed on a single gel. The immediate goal of these projects is to develop a commercial instrument capable of sequencing 15,000 nu- cleotides per day, starting from the appropriate reaction mixtures. A second approach is being developed in Japan, with assistance from the government and an industrial consortium that includes the Hitachi, Fuji, and Seiko corporations. This attempt to improve DNA sequencing emphasizes robotics and automated processing of samples. The automated steps begin with subcloned DNA fragments and carry them through the sequencing reactions. One such prototype performs more than 30 steps in the complex set of reactions that are required in the chemical sequencing strategy. Each step is controlled by a ~microcomputer. The maximum daily output of a single instrument is a sequence of 5,000 nucleotides. Current work emphasizes the orga- nization of the overall sequencing experiment into a production line. The goal of this approach is to establish an automated facility able to sequence 1 million nucleotides per day at a cost of approximately $0.20 per nucleotide (Wade, 19871. This cost- does not include the substantial cost of preparing the DNA fragments to be sequenced. The production-line approach would feature both automated and manual steps, with those operations most amenable to mechanization automated. A third approach, called multiplex sequencing, depends less on automation and more on increasing the amount of sequence data that can be obtained from one set of chemical sequencing reactions fractionated on a single sequencing gel. Each sample analyzed contains a mixture of 40 or more DNA samples, each of which has been marked with a unique short nucleotide sequence (an oligonucleotide sequence). After the normal chemical sequencing reactions have been completed, the unlabeled samples are fractionated on a standard sequencing
SEQ UENCING 69 gel, and the separated DNA fragments are transferred to a membrane on which the spatial patterns of the fragments formed during electro- phoresis are preserved. The sequencing ladder for each individual sample can then be successively visualized by a series of DNA-DNA hybridization assays, each using a different radioactive oligonucleotide as a DNA probe that is specific for the reference end of one particular subclone in the mixture. In principle, if a dozen sets of 40 mixed samples each are subjected to this analysis on a single gel and each can generate 250 nucleotides of DNA sequence, then a sequence of 120,000 nucleotides can be derived from one set of chemical sequencing reactions by using sequential hybridization with the membrane pro- duced by this method (G. Church, Harvard University, personal communication, 1987~. All these methods utilize the chemistry or enzymology of current sequencing procedures. An intriguing question is whether fundamen- tally more powerful technologies are likely to arise in the foreseeable future. little research is being directed toward this problem. The most obvious possibilities for future sequencing techniques would be the use of sensitive physical methods such as mass or fluorescence spectroscopy, magnetic resonance detection, and electron microscopy. These might be used in combination with each other or with more conventional biochemical fractionation methods. The disparity be- tween the capabilities of the current technology and the magnitude of the work required to sequence the human genome suggests that fundamentally different technologies deserve serious exploration. OPTIONS AND RECOMMENDATIONS The committee considered three options regarding the initiation of human genome DNA sequencing. The first is to begin a large-scale initiative immediately in one or a few large centers devoted to DNA sequencing with current technology. This option might be expected to include the establishment of an independent institute whose goal would be the mapping and sequencing of the genome as quickly as possible. The second option is to make a strong commitment to develop better DNA cloning, sequencing, and data analysis technol- ogies by supporting smaller scale pilot projects (e.g., sequencing 1 million nucleotides in 1 year), while allowing investigators to gain practical experience with larger scale sequencing. These improvements in current technology should be designed to reduce the cost and increase the efficiency of sequencing techniques. The third option is to make no special effort to sequence the human genome, but to .
70 .= cn ~c c v, ~c ~ - ~ - 3 - o - ,,, - ~ - o "OC 5 _ ~ ~ ~n E ~ ._ C) ._ o c, ~ _ _ _ Ct ,~ ~L) C,) n ~o G - C) ~C t: ._ C 5 C C _! ,C Ct o ._ e~ C CC ~: - ~C _ Ct _ ~ Ct ~ . _ . ~ $_ _ ~ 00 ~ ._ ~ 04 C ~1 ~ ~ P!: C) Z V Ct ~ _ _ _ 0N - _ C Ct Ct :: - 04- Ct ~ C) O - s_ C) C~ C >, O ._ _ _ V, _ ._ O ·- ~ 3 - ~ . C3. ~ . .. ~ G) ~: o- ~ o~ D Ct ·CO o' oa ._C ~ ._ . :~. Q X ~ C ~Q O _ I, Z ·- - C~ o C) - Ct La o ~5 G D .= CO o O- _ ~ C C ~ ~ x cn = 0 0 ce a~ c E~ Q ._ . _ V CO Q ~Z - ~ ~ ~C O ~ ~_ =-- O U' LL CL Q _ U, : a, :: ,.., ~C (, ~0,< ·-Z 3.~C ~_ ~O ~ .0 _ O ' _ ~ ~ _ .= CO a, ~ ~a O c': n <: V ~ ~ ~ ~ ~ ~: ~' Z =! ~ te ° ~ _ ~ O c~ CO c - a, _ _ a~ := ( _ O ~ 0 ~ C _ ._ a), ~ _ ~ ~ ._ p_ C Ct c, c: ~ ~: ,
71 a, - ~ o cO- ~ ~ s ~ ~ o a) ==Pc ~ - 0 tD - ~z - o a) cc) ~ =. c-~ > . e x Q ._ ~ ~ .c ._ ~ ~ E a) q, ~ I _ =~s ~_I~]J ~ s~ a) (~ ~ O ~ v, ° ~ O_ ~ U=) 2 x·. E 0 c ~ : 4) x ~ - ~ 0 ~ l . 8~< . pO . -o'~ . .~= ~oa~ ,tn ce _ 0 0 _ 0 c' - . _ _ cn - ~ a, _ c~ .' c (13 '- _ > 1 X ~ 0 ct ~ (: ·_ 6 c' uo 1 oM =O =_ ~ J c - c . o. n <`, 2 ~ =_.° - ~ ~a (D ~c ~ c' ~ c G) Q ~ C ,, C) - - -"S O m.' ~ C O.~= E s cO a, E CCo °' a' Ct C ~Q ~Q ~= Q~ ~ ~ o~ C) ~) 20 C) D E Q ~n c
72 MAPPING AND SEQUENCING THE HUMAN GENOME depend on the normal process of science to generate the sequence, knowing that the complete sequence would not be available within the next 20 years, if ever. As explained in Chapter 3, knowledge of the sequence of the human genome and the genomes of the necessary reference organisms will provide a crucial medical and basic research too! that will be used by the biological and biomedical research community long into the future. The committee concluded that without a special effort to achieve this goal, the desired DNA sequences are not likely to be obtained in the time optimal for future medical and scientific advances, if ever. On the basis of this argument, the committee rejected option 3. In deciding between options ~ and 2, the committee concluded that the high cost and slow rate of sequencing with current technology precluded the initiation of a large-scale sequencing effort at the present time. Therefore, the committee made the following recommendations. The Project Should Begin with Two Kinds of Studies Initially, improvements in existing technology and the development of new technology directed toward the long-range goal of a complete human genome sequence should be vigorously encouraged. This effort would include applications of automation and robotics at all steps in cloning and sequencing. It is particularly important to automate the steps of DNA cloning. In this context it is useful to think in terms of trying to achieve 5- to 10-fold incremental improvements in the cost, efficiency, or human labor required for these tasks. Several such incremental improvements are needed to make the sequencing of many important genomes practicable. A reasonable baseline sequence from which to measure initial progress is 150,000 nucleotides, the size of the largest human sequencing project to date. These technological projects will assist in identifying the rate- limiting step in large-scale sequencing, which at present is believed to be the subcloning ste~the one step that has not been automated. However, further improvements in all stages of the procedure from subcloning to the interpretation of sequence data will be required. The awarding of competitive grants to individuals and to larger groups organized into cooperative, multidisciplinary centers is viewed by the committee as the most effective way to achieve these goals. A second type of pilot study that should be initiated immediately would define as its goal sequencing approximately 1 million nucleotides of continuous sequence (approximately 5 to 10 times what has been achieved to date). Such projects will be important in providing an opportunity for the direct implementation and testing of improvements in existing technology as they arise and the provision of a practical
SEQUENCING 73 impetus for the development of new technology. They will also reveal where problems in analysis are likely to arise. For example, will repetitive sequences complicate the assembly of a unique, contiguous sequence? Are some sequences unclonable? Will new genes be identified correctly? As in the past, human gene sequencing by individual research groups interested in specific genes shouicl be strongly supported by standard research grants. This directed sequencing will provide val- uable information about genes that have been identifier! as important in biology and medicine and should also lead to advances in sequencing technology. However, as the physical map develops and as the cost and efficiency of DNA sequencing improve, ever-larger sequencing efforts taken on by groups interested primarily in the sequence of the genome as a goal in and of itself will evolve. To Derive the Full Benefit of the Human Genome Sequence Wig Require Many New Tools, Including a Comprehensive Database of DNA Sequences from Other Organisms Comparative sequence analysis has proven a powerful technique for distinguishing those elements of a gene sequence that are highly constrained functionally from those that are not. As explained pre- viously, such analysis can provide insights into conserved regulatory and structural sequences. The availability of extensive sequence data from other organisms will also maximize the likelihood that the counterparts of important human genes will be identified in other organisms where their functions will generally be easier to study. The corollary will also hold: Genes that have been identified as important to other organisms will be found rapidly in the human DNA sequence. Therefore, a project of this type must not be restricted to determining the human genome sequence, but should include genome sequence determination of selected other species as well. DNA Sequence Determinations Require Quality Control A mechanism of quality control must be developed to monitor the groups that are contributing extensive sequence DNA information. One might consider an external group that functions as the Bureau of Standards does to provide independent quality control. Quality control is critically important to the initiative, and it poses unique technical challenges. The optimum methods of checking DNA sequences are likely to differ from the optimum methods of collecting data; indeed, the sequence-checking method should ideally be experimentally in- dependent of the sequencing method. For example, the presence of
74 MAPPING AND SEQUENCING THE HUMAN GElVOME the many restriction enzyme cleavage sites predicted from the DNA sequence could be tested by cleavage of the DNA followed by gel electrophoresis. To succeed, this project will require careful oversight and coordi- nation among the groups involved in mapping, sequencing, collecting and analyzing data, and a system for distributing samples. REFERENCES Alberts, B., D. Bray, J. Lewis, M. Raft, K. Roberts, and J. D. Watson. 1989. Molecular Biology of the Cell, 2nd ed. Garland, New York. In press. Botstein, D., R. L. White, M. Skolnick, and R. W. Davis. 1980 Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am. J. Hum. Genet. 32:314-331. Darnell. J. H. Lodish, and D. Baltimore. 1986. Molecular Cell Biology. Scientific American Books, New York. 1160 pp. Doolittle, R. F., D. F. Feng, M. S. Johnson, and M. A. McClure. 1986. Relationships of human protein sequences to those of other organisms. Cold Spring Harbor Symp. Quant. Biol. 51:447-455. Elder. J. K., D. K. Green, E. M. Southern. 1986. Automatic reading of DNA sequencing gel autoradiographs using a large format digital scanner. Nucleic Acids Res. 14:417- 424. Gilbert, W. 1978. Why genes in pieces? Nature 271:501. Gilbert, W. 1985. Genes-in-pieces revisited. Science 228:823-824. Gusella, J. F., R. E. Tanzi, M. A. Anderson, W. Hobbs, K. Gibbons, R. Raschtchian, T. C. Gilliam. M. R. Wallace, N. S. Wexler, P. M. Conneally. 1984. DNA markers for nervous system diseases. Science 225: 1320-1326. Maxam, A. M., and W. Gilbert. 1977. A new method for sequencing DNA. Proc. Natl. Acad. Sci. U.S A. 74:560-564. Sanger, F., S. Nicklen, and A. R. Coulson. 1977. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U.S.A. 74:5463-5467. Sanger, F., A. R. Coulson, G. F. Hong, D. F. Hill, G. B. Petersen. 1982. Nucleotide sequence of bacteriophage A DNA. J. Mol. Biol. 162:729-773. Smith, L. M., J. Z. Sanders, R. J. Kaiser, P. Hughes C. Dodd, C. R. Connell, C. Heiner, S. B. H. Kent, and L. E. Hood. 1986. Fluorescence detection in automated DNA sequence analysis. Nature 321 :674-679. Staden R., and A. D. McLachlan. 1982. Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res. 10:141-156. Wada, A. 1987. Automated high-speed DNA sequencing. Nature 325:771-772. Wu, R., and E. Taylor. 1971. Nucleotide sequence analysis of DNA. II. Complete nucleotide sequence of the cohesive ends of bacteriophage A DNA. J. Mol. Biol. 57:491-511.