Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
6 The Collection, Analysis, and Distribution of Information and Materials The mapping and sequencing effort will generate more data than any other single project in the history of biology. For example, just to record the 3 billion nucleotides that make up the haploid human genome will require nearly 1 million pages of printed text. Variation between the two chromosomes of each individual (heterozygosity) and among the many human beings (polymorphism) further increase the body of information to be stored, collated, and analyzed. In the conception and planning of any human genome project, close attention must be paid to how the data and experimental material are collected and distributed. The full set of information to be gained from mapping and sequencing the human genome is of potentially greater usefulness than its component parts. For example, although in principle one can use computer searches to pick out coding sequences that are parts of genes, finding the true beginning and end of a gene and all of its coding and noncoding components may require reference to other data, such as the similar nucleotide sequence from a related, but evolutionarily separated, species such as the mouse. To extract the maximum information from the human sequence, it will also be necessary to search for amino acid homologies with the entire set of all known proteins, regardless of their origin. In addition, extensive searches for regions of similarity of short nucleotide sequences between human genes and their mouse counterparts will be necessary to detect regulatory DNA sequences and other conserved sequences for which functions can then be sought. Finally, the correlation of sequence data with the large amounts of information derived from 75
76 MAPPING AND SEQUENCING THE HUMAN GENOME human genetic linkage and disease studies is needed to derive the molecular basis for human phenotypes. As more DNA sequence information is obtained, our sophistication in interpretation should increase to the point at which a computer search will reveal a fascinating wealth of correlative data concerning almost any new DNA sequence obtained. The human genome project will differ from traditional biological research in its greater requirement for sharing materials among laboratories. For example, many laboratories will contribute DNA clones to an ordered DNA clone collection. These clones must be centrally indexed. Free access to the collected clones will be necessary for other laboratories engaged in mapping efforts and will help to prevent a needless duplication of effort. Such clones will also provide a source of DNA to be sequenced as well~as many DNA probes for researchers seeking human disease genes. Two different types of centralized facilities will be needed: centers to collect and distribute materials such as DNA clones and human cell lines and centers to collect and distribute mapping and sequencing data. The magnitude of the required data storage and distribution effort can be understood by comparing the existing content of facilities that collect and store mapping and sequence data with the anticipated capacity required if the human and other complex genomes are sequenced. For example, the DNA data bank maintained by GenBank and the European Molecular Biology Laboratory (EMBL) contains 15 million nucleotides of sequence data from the entire biological world viruses, procaryotes, plants, and animals-and includes about 2 million nucleotides of human sequence. The human genome, con- taining 3 billion nucleotide pairs, is 200 times as large as the sum of these DNA sequences collected from all organisms. Moreover, only a few hundred restriction fragment length polymorphisms (RF~Ps) have been mapped on the human genome, whereas the target of the genome project is several thousand mapped RF~Ps an increase of more than an order of magnitude. The efficient cataloging, manage- ment, and distribution of mapping and sequencing data at levels from one to two orders of magnitude greater than today's must be achieved in pace with data acquisition and are essential for the success of the project. Fortunately, several prototypic operations are already in place. These include GenBank/EMBE, Mendelian Inheritance in Man, Human Gene Mapping Library, and Centre d'Etude du Polymorphisme Humain, each of which is briefly reviewed below to provide a background for discussion of the much larger efforts that will be needed in the future. There are also repositories of cell lines and cultures, such as the American Type Culture Collection and the Cell
INFORMATION AND MATERIALS 77 Bank in Camden, New Jersey, that have had extensive experience in handling and distributing biological materials. Although these opera- tions are not reviewed here, they should be considered in the development of any materials-handling center. PRESENT INFORMATION-HANDLING ORGANIZATIONS GenBank/EMBL The GenBank/EMBE data bank stores and distributes DNA se- quence information. GenBank in the United States and the EMBE data bank in the Federal Republic of Germany share the task of recording, annotating, and distributing all the DNA sequence data published, regardless of the species of origin. Each bank is responsible for monitoring approximately half of the relevant published literature, and once they have entered and annotated the files, they exchange information so that each has a complete holding. The current holdings are about 15 million nucleotides, and the rate of acquisition is currently 7 million nucleotides per year and increasing rapidly (H. Bilofsky, personal communication, 19871. Both banks are finding it more and more difficult to keep up to date. The backlogs in the entry of published sequence data, a source of frustration within the user community, have two main causes. First, nonelectronic data entry (entry from printed DNA sequences) still accounts for more than half of all data entered as a result of policy and organization, not technology. Authors have yet to be educated about the need to send data either electronically or in magnetic form to the data bases, in part because coordination between scientific journals and the data bases has, until very recently, been nonexistent. The reentry of data from a printed copy of a sequence into a data base is a slow, error-prone process, but in the absence of pressure from journals to authors to provide all sequence data in magnetic form, it has been absolutely necessary. Second, GenBank/EMBE have not had sufficient support to keep abreast of the gene sequence data being generated by present biomed- ical research. However, the lessons that have been learned from their experience should be invaluable in setting up and managing a new facility dedicated to the collection of DNA sequence data, which will be an essential component of a human genome project. Mendelian Inheritance in Man The Mendelian Inheritance in Man (MIM) project stores and classifies information about human disease phenotypes. MIM is an
78 MAPPING AND SEQUENCING THE HUMAN GENOME encyclopeclia of gene loci based on 'human phenotypes, most of them disease phenotypes. It has been maintained at The Johns Hopkins University by Victor McKusick since the early 1960s and has been computerized since 1964. Seven hard-copy editions, all generated from the computer, have appeared between 1966 and 1986, and the number of entries has grown during this time from 1,500 to more than 4,100. An attempt is made to assign only one entry per genetic locus; i.e., various phenotypes produced by alleles at one and the same locus (e.g., the beta-gIobin locus) are allowed only one entry. Inevitably, however, more than one entry has been assigned to many allelic disorders because of the incomplete status of our knowledge; in other cases a disorder assigned one entry subsequently proved to be produced by any one of two or more loci. Entries have also been created in the catalog for loci for which no Mendelian variation has yet been identified. Most of these are structural genes that have been characterized and mapped by a combination of somatic cell and molecular genetic methods. Collaborative research in the management of this knowledge base at the Lister Hill National Center for Biomedical Communications of the National Library of Medicine has produced OMIM- an on-line version of MTM that is being tested in the clinic and laboratory. OMIM is designed to permit free movement between the text of MIM, gene map information, and a molecular defects list. Human Gene Mapping Library The Human Gene Mapping Library (HGML) at Yale University positions genes and DNA landmarks on chromosomes (publication of Howard Hughes Medical Institute, 19861. The HGML consists of a number of separate but interrelated data bases. One of them, the "Map" data base, keeps track of the chromosomal positions of all mapped genes (currently more than 1,200~. This is a dynamic data base: New genes are being entered at an accelerating rate, refinements of previous assignments are continually made, and the relations between gene map positions frequently change. The management of this data base requires constant attention to data input, editorial checking on the validity of the data, and data distribution. The data are maintained with an advanced data-base management system that is operated by a high-speecl, large-volume computer. User-frienclly menus have been prepared to facilitate access to the data by the 1nexperlencec A.
INFORMA TION AND MA TERIALS 79 Other data bases within HGML include ".Lit," which contains a list of all pertinent literature citations; "RF:UP," which contains data on Ramps; "Probe," which contains data on DNA probes used for mapping; and 'iSource," which contains information regarding the laboratories from which certain DNA probes or cell lines may be obtained. The HGML data base, together with the scientific community it serves, also strives to maintain a uniform and orderly nomenclature for all mapped genes. It will be important to extend this nomenclature (or another that is agreeable to the scientific community) to other species, such as the laboratory mouse, so that direct comparisons between homologous genes in different species can be made readily. The HGML also assigns accession numbers to all DNA probes that might-be useful as genetic markers. Upon request, researchers active in this field are given unique DNA probe identification numbers (D numbers), so that these probes can be described unambiguously. More than 2,000 probes have been numbered, and rapid growth to more than 100,000 is expected in the years ahead. An extension of this type of system could serve as a logical means of keeping track of the overlapping DNA clones produced by a human genome project. Centre d'Etud~e du Polymorphisme Humair' The Centre d' etude du Polymorphisme Humain (CEPH) coordinates an international RAP mapping effort using data from standard families (Marx, 1985; Dausset,- 19861. CEPH, created by Jean Dausset in Paris, differs from MIM and HGML in being a collaborative research effort that both generates and stores human mapping data. As discussed in Chapter 4, CEPH maintains lymphoblastic cell lines and sends samples of DNA from these cells to collaborators in Europe and North America. In return, the recipients agree to test their REAP probes on all- the so-called "informative', families for each probe (the families in" which two alleles of the particular REAP are present). Collaborating members of CEPH are required to submit to Paris all of their RAP probe mapping data in a prescribed, uniform format. CEPH then maintains a common data base to which members of the project have rapid access, which thereby allows members to place their own RAP probes on a common linkage map. Through this collaborative project, the work of several laboratories on different continents is coordinated toward a common goal, which- can be achieved much more rapidly than it could be in any one laboratory alone.
80 MAPPING AND SEQUENCING THE HUMAN GENOME MAPPING DATA BASES REQUIRED FOR A HUMAN GENOME PROJECT The Collaborative Facilities Needed To Generate an RFLP Map Must Be Expanded One of the early goals foreseen for the human genome project is an RAP map in which the average separation of markers is 1 cM. CEPH provides an example of how international collaboration, in- volving both the exchange of materials (DNA samples) and data (each group's probe-mapping results), can be organized for the production of an RAP map held in common. However, to achieve a 1-cM RAP map in a timely fashion (5 to 10 years), either CEPH must be expanded substantially or a new organization must be created and modeled along its lines, with the following objectives: · A significant increase in the number and diversity of origin of families. · Identification of several thousand new REAP probes and their use to screen the set of DNAs obtained from these families (requiring either more laboratories or the enlargement of existing ones). · A major increase in DNA production facilities because of the increased number of families and REAP probes to be used with these DNAs. At present, CEPH maintains stable lymphoblastoid cell lines derived from each of its 600 participants. It grows batches of the cells, extracts the DNA, and distributes it. More than one center may have to be established to grow the cells and to produce and distribute DNA. At present, the laboratories collaborating with CEPH do not have to make available to the project their REAP probe DNAs; they need only provide the data obtained with them. This helps to make the CEPH collaboration successful by avoiding constraints that might otherwise restrain participation. In the future, however, rules con- cerning the general availability of RAP probes will have to be decided within the context of a human genome project. If REAP mapping is done uncler contract by commercial enterprises, some of which already have considerable experience in the field, the contract should stipulate that there be open access to all the probes that are developed. All Human Map Data Should Be Accessible from a Single Data Base In the major mapping data base associated with the human genome project, it will be necessary to keep track of the map positions,
INFORMATION AND MATERIALS 81 literature references, and material distribution sources for all identified landmarks in the human genome, including the DNA clones in the ordered clone collection. This can best be accomplished by having a single centralized data base that is easily accessible to the entire scientific community. A large data facility will be needed to manage this information.- Initially, this facility would be responsible for integrating all RAP mapping and DNA clone data, which would include all the information now in the MIM and HGML data bases. Once a human genome project begins generating large amounts of data, the annual management costs of mapping data bases are likely to rise from the total of $800,000 currently spent by M]:M and HGML to perhaps $5 million. Whether the mapping data bases that are unified by a single management organization should also be housed under one roof is an open question. During the first stages of the project, and as long as MIM and HGML are electronically linked, it may be more practical to leave them in different institutions. A Materzal Collection and Distribution Facility for Ordered Sets of Cloned DNA Fragments Will Be an important First Stage in Any Sequencing Project The representation of the physical map in a DNA clone collection is immediately useful in that DNA segments of unknown origin can be located on them either by hybridization or by fingerprinting methods. Ultimately, the best physical map is the complete set of all such materials along with the information data bases described above. A separate dedicated facility will be required if these materials are to be made readily accessible to the entire user community. Maintaining a facility that collects, organizes, and distributes all the available DNA clones generated by mapping efforts will be a major task. Further study will be needed to determine exactly how this facility should operate. At one extreme, one could imagine that such a facility would merely store DNA clones received from all participating laboratories (as DNA, as bacterial viruses, or as yeast cells carrying artificial chromosomes), index them according to some reasonable plan, and then redistribute them for a standard fee in response to specific requests from scientists. Because of the very large number of clones expected in the collection (at least several hundred thousand versus the 42,000 items now at the American Type Culture Collection), this aspect of the task will require major orga- nizational efforts like those of a large mail-order company. In addition, stocks will also have to be replenished at intervals to keep the collection adequately supplied with materials. Because of possible
82 MAPPING AND SEQUENCING THE HUMAN GENOME clone instabilities, both these regrown stocks and each new stock received will require checking (by restriction enzyme fingerprinting or some other high-resolution method) as a standard quality-control procedure. An additional possible routine role for the central facility includes converting large human DNA fragments cloned as artificial chromo- somes into more readily accessible bacterial virus or cosmic DNA clone collections. The facility could also take all the DNA clones that have been mapped elsewhere by a variety of different procedures and fingerprint them by a single method to provide a standard indexing procedure. One can also envision a central facility that would actually help with the mapping effort; this type of facility could establish a single standard protocol for characterizing each DNA clone (for example, a standard restriction enzyme fingerprinting method) and collect and analyze the data provided by each of the participating laboratories to search for new overlaps. At present, mapping methods are in a state of flux, and many competing approaches are being tried in different laboratories. Any mapping role for a central facility should therefore be delayed until a reasonable consensus can be reached on the best way to proceed. The cost of constructing and operating a storage and distribution facility will be high. Estimates of as much as $250 million spent for 30 years of operation have been made once the full range of clones has been generated (Stevenson, 19874. A DNA SEQUENCE DATA BANK DEDICATED TO A HUMAN GENOME PROJECT A Concerted Initiative Aimed at Determining the Sequence of the Human Genome Will Generate Large Amounts of DNA Sequence Data Not only will there eventually be many billion nucleotides of human DNA sequences, but also there will be large tracts of sequence from the mouse genome that can be used for comparisons between the two species. In early stages of the sequencing portion of the project, it is likely that the genomes of experimental mode} organisms such as E. colt, yeast, the nematode, and Drosophila will be completely se- quenced. If the project is to succeed, all data on large amounts of contiguous DNA sequence should be collected and distributed by a dedicated DNA sequence bank. Fortunately, the amount of data associated with a human genome project is well within present disk storage and computer hardware
INFORMA TION AND MA TERIALS 83 capacity. Many government agencies- as well as the business worId- are storing and handling significantly larger volumes of data. The difficulties will be encountered in the entry arid classification of the data and even more so in their analysis and distribution to the international scientific community. An important goal of the entire endeavor should be to make available the information in a form that will benefit a very large portion of the biomedical and basic research community as quickly as possible. All Data Must Be Entered Electronically or Magnetically .. From the outset, all sequence data must be entered into the DNA sequence bank by electronic or magnetic means. Moreover, the human genome project can circumvent many of the problems experienced by GenBank/EMBE by establishing a standard features format imple- mented at the point of data collection with the intention of expediting data entry. For example, all submitted data blocks could be packaged with references by the sender to data source, DNA clone number, chromosome region, and other factors. Since most data will probably be sent from fewer than two dozen research laboratories, the chances of entering spurious data from inexperienced investigators will be low. Nonetheless, there must be standards that set a minimum length of contiguous sequence suitable for submission and ensure quality control with regard to the frequency of errors in the accepted sequence. An Initial Analysis Should Be Performed by a Central Facility Not unexpectedly, many different points of view exist about how, where, and when the large amounts of data in the genome sequence ought to be processed. New computers are constantly appearing, and new strategies for using them are always evolving. The most important analyses will no doubt be done by people interested in specific types of proteins, regulatory sequences, evolutionary processes, and so forth. However, some analysis should also be performed at the central facility to help in classifying the data for future research. Exactly how the data are to be analyzed might be tied to the number of centers or laboratories collecting the data, the kinds of staffing provided at a central facility, and the scope of the immediate data dispersal, i.e., whether it is national or international. An Example of an lnifial Sequence Analysis The strategy of data analysis will have to evolve as data accumulate, but the primary question will always be whether a particular sequence
84 MAPPING AND SEQUENCING TlIE HUMAN GENOME is an important island of information or just part of a surrounding ocean of chaos. Accordingly, the incoming data might be screened for repeated sequences. Even the most interesting parts of the human genome- the 50,000 to 100~000 genes are going to be redundant, inasmuch as there will be many large families of closely related genes. The central nervous system, for example, which may account for 40 percent of all~genes, is almost certainly going to include many such families. Some type of screening can help catalog the incoming data from the start and determine where and how the data should be stored. To encourage the timely submission of data, all data submitted by the sequencing centers should be rapidly returned to them in a processed form for inspection and verification; each center should also be kept aware of progress at the others. Establishing an Efficient Computer Network Many possible computer arrangements would suffice for the jobs described. It seems reasonable, however, to begin the operation on a modest level, with the intention of scaling up over several years. For example, the operation could easily be initiated with local computers that are connected with the National Supercomputer Network. In this model, the data collected at a sequencing center would be fed into a local computer, checked and entered into a features table, and then transmitted over the Supercomputer Network (which is especially good for high-speed transmission of large amounts of data) to the central DNA sequence bank. There, an analytical facility, which would probably use parallel computers at some future date, could handle the early stages of data analysis. The screened data would then be transmitted back to the various collection centers for verifi- cation. Once verified, the data would become available to the scientific community, moving through the Supercomputer Network to various local distribution points. The Need for Research 011 Data Analysis We are only at the beginning of learning how to use computers to interpret DNA sequence information. New ways of searching DNA sequences will need to be designed as we learn more about such subjects as the binding sites for gene regulatory proteins, the rules that regulate RNA splicing, RNA secondary structure, and the effects of specific amino acid replacements on protein folding. In the future, we can expect to learn a great deal more about genes from their sequences than is possible today. A human genome project should therefore encourage the activities of those who combine skills in
INFORMA TION AND MA TERIALS 85 computer programming and biology; these individuals will be needed to generate the DNA sequence search routines of greatest utility to the biological community. The Estimated Cost Although it is difficult to predict the cost exactly, approximately $5 million per year might be set aside for the sequence information facility. The largest part of the costs related to information handling will undoubtedly be devoted to professional staffing. Beyond that, funds will also have to be made available to develop software and to provide education and training to ensure further innovations in computer use in biology. It is essential to keep the conventional data bases, including GenBank/EMBE, fully operational for the next several years, in particular to ensure comprehensive collection of sequence data from nonhuman sources. However, the time will come when sequence data from all sources will have to be melded into a single large, efficient facility. CONCLUSIONS More than any other part of the human genome initiative, the handling of information and material will require organization and standardization. A single unified policy must prevail if the information is to be accurately acquired, stored, analyzed, and distributed. There must be a central facility for tracking and distributing the experimental materials, and there must be a dedicated computer center for storing, checking, screening, and searching the sequence and mapping data. The establishment of these facilities will be critical and will require careful advance planning. The committee recommends a competition in which all interested groups submit detailed applications or pilot program trials. REFERENCES Dausset, J. 1986. Le centre d'etude du polymorphisme humain. Presse Med. 15:1801-1802. Marx, J. L 1985. Putting the human genome on the map. Science 239:150-151. Regional Localizations of Genes and Genetic Markers to Chromosomes and Subregions of Chromosomes. 1986, Number 1, HGM8. Howard Hughes Medical Institute Human Gene Mapping Library, New Haven, Conn. Stevenson, R. E. Cited by L. Roberts, 1987. Human genome: Question of cost. Science 237:141 1-1412.