4
Sample Collection and Data Management

Among the most important benefits to be derived from an international effort to study the extent of human genetic variation is the establishment of repositories of specimens and collections of data relevant to them. These repositories can fulfill a potentially important role in human genetic and anthropologic research only if the collection and management of the samples and data are well standardized and appropriate to the needs of a large body of investigators. Ideally, the specimens would be so collected and stored as to accommodate the future developments in molecular biology that can be anticipated today. It would, indeed, be tragic if after a few years it were found that the specimens that had been stored were no longer suitable for emerging research needs. This chapter reflects the committee's thoughts on how those ends can be best served generally, but detailed guidance was not part of the committee's charge. Accordingly, given the importance of this potential resource, the committee recommends that a panel be convened to provide detailed guidance before a major sampling effort and specimen collection are begun.

SOURCES OF DNA TO BE SAMPLED

Peripheral Blood

DNA can be prepared from almost all human cells or tissues. Peripheral blood can be easily and relatively painlessly obtained from an arm, a leg, or even an earlobe. Samples can be stored at room temperature for up to a week without loss of quality of the DNA. Extraction is routine; various protocols are common



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 36
Evaluating Human Genetic Diversity 4 Sample Collection and Data Management Among the most important benefits to be derived from an international effort to study the extent of human genetic variation is the establishment of repositories of specimens and collections of data relevant to them. These repositories can fulfill a potentially important role in human genetic and anthropologic research only if the collection and management of the samples and data are well standardized and appropriate to the needs of a large body of investigators. Ideally, the specimens would be so collected and stored as to accommodate the future developments in molecular biology that can be anticipated today. It would, indeed, be tragic if after a few years it were found that the specimens that had been stored were no longer suitable for emerging research needs. This chapter reflects the committee's thoughts on how those ends can be best served generally, but detailed guidance was not part of the committee's charge. Accordingly, given the importance of this potential resource, the committee recommends that a panel be convened to provide detailed guidance before a major sampling effort and specimen collection are begun. SOURCES OF DNA TO BE SAMPLED Peripheral Blood DNA can be prepared from almost all human cells or tissues. Peripheral blood can be easily and relatively painlessly obtained from an arm, a leg, or even an earlobe. Samples can be stored at room temperature for up to a week without loss of quality of the DNA. Extraction is routine; various protocols are common

OCR for page 36
Evaluating Human Genetic Diversity to most molecular-genetics laboratories. The amount of DNA that can be obtained from a 10-milliliter (10-mL) blood sample is about 500 micrograms (µg, a millionth of a gram). One genotyping assay with the polymerase chain reaction (PCR) requires 10-50 nanograms (ng, a thousandth of a microgram) of DNA, so each standard blood sample should permit 10,000-50,000 genotypings, potentially a coverage of about 1 market per 60-300 kilobase (kb) pairs of the human genome. This marker density should be more than sufficient for most genetic analysis. The number of possible assays can be substantially increased if care is given to the technical details of the PCR assays. For example, 10 times as many assays are possible if multiple rounds of PCR with more than a single primer pair are carried out (called hemi-nested or full-nested PCR). Such methods are routinely used for analyzing DNA samples of less than 1 ng (see, for example, Leeflang and others 1995). If 1-5 ng were used in each assay, 100,000-500,000 assays could be carried out on each 500-µg DNA sample. A 1-ng DNA sample contains about 300 copies of each single-copy gene, so both alleles at a heterozygous locus would be sufficiently represented to allow accurate genotyping of the sample (Navidi and others 1992). Amplification of more than one marker at a time in a sample (multiplexing) is readily achieved and is routine in many laboratories. Depending on the effort devoted to working out the required conditions, the number of loci examined in one assay could increase by a factor of 5-30. In summary, careful design of the PCR amplification strategy by using procedures that are available today could allow 500,000 assays to be carried out on 500 µg of DNA isolated from 10 mL of blood (5 loci multiplexed on each 5 ng of DNA). Other Tissues When blood sampling is not feasible because of cultural or technical difficulties, alternative sources of DNA can be considered. Buccal cells have been successfully used for DNA analysis in many different applications. Surface epithelial cells are collected from the side of the oral cavity with a sterilized scraper, and up to several micrograms of DNA can be extracted. DNA can also be prepared from hair follicles, but the amount of DNA recoverable is less than that with buccal sampling. Transformed Cell Lines Peripheral blood can also be used as a source of white blood cells for the establishment of transformed cell lines. A subset of white blood cells (known as B cells) can be transformed by Epstein-Barr virus (EBV) infection into permanent cell cultures in the laboratory. For EBV transformation of B cells, the white

OCR for page 36
Evaluating Human Genetic Diversity blood cells are separated from the red cells by a simple sedimentation procedure and mixed with an inoculum of virus preparation. The culture is then incubated at 37°C without disturbance for 2-3 wk. During this time, a subpopulation of the cells begins to proliferate, and in about 1-2 months a permanent culture is established. The transformation process is not always successful, however. Besides possible variability in the handling of blood samples and in cell-culture techniques, there is intrinsic variation in the number of B cells among people and among samples from a given person. The overall transformation success rate in a highly experienced laboratory is about 95% with 10 mL of blood. Transformed cells can be stored in liquid nitrogen and returned to culture later with rarely any difficulties. It is already possible to immortalize other cell types, such as T cells (a particular kind of white cell) from blood and fibroblasts from skin. Development of methods for establishing cell lines from buccal samples or hair roots would be of great value. The DNA extraction protocols for transformed cells are essentially the same as those for blood samples. These cells can be readily transferred from laboratory to laboratory. DNA can then be extracted from the expanded cell cultures to provide each laboratory an essentially unlimited supply of DNA for genetic analysis. The advantage of having cell lines is not restricted to providing an unlimited source of DNA for marker analysis. Some questions related to biomedical applications could not be addressed simply through DNA marker analyses. For example, cell lines are needed to produce large intact DNA fragments for long-range physical mapping studies and genome analyses at the chromosomal level. Transformed cell lines are preferred for collecting samples for human genetic-variation research. That is particularly true for populations that are small and hard to sample but that might yield interesting information about human prehistory. Either because these populations are in danger of disappearing or being extensively admixed with their neighbors or because sampling them repeatedly might be both intrusive and impractical, it is important to have a sample that can support extensive work without a need for resampling. However, the cost of establishing transformed cell lines routinely can impose a substantial financial burden on the study of human genetic variation and is not recommended, except for extenuating circumstances, such as those just cited. The long-term cost for cell-line storage and maintenance is substantial. There is an urgent need for technologies for cheaper and more-reliable cell immortalization. Collections will be made in many remote areas, so methods that allow successful transformation after long storage under nonideal conditions would be particularly important. Transformation efficiency decreases greatly with time between blood collection and arrival in the laboratory. Finally, biosafety guidelines have to be strictly enforced to protect laboratory personnel, especially when they handle tissues or body fluids collected from different parts of the world. The use of transformed cell lines, however, poses some difficulties for varia-

OCR for page 36
Evaluating Human Genetic Diversity tion research. Using transformed cells in some cases could affect the usefulness of short-tandem-repeat (STR) analysis. STR mutation frequencies have been shown to be higher in transformed lines than in cells taken directly from the subject (Weber and Wong 1993) Transformation might also cause large DNA rearrangements in some regions of the genome. The transformation process can also select for particular subpopulations of cells that have undergone specific sequence alterations. Moreover, B-cell cultures are not suitable for answering some fundamental biomedical questions about gene expression. Gene-expression profiles can differ among different transformed cell lines, so caution is in order if they are used to study phenotypic variations at the cellular level. Some of the advantages of the cell-line technology become less important when new DNA-amplification, cloning, and marker-analysis methods are developed. The amount of genetic information that could be obtained with one 10 mL blood sample could increase several orders of magnitude beyond the 500,000 assays that the current technology would allow. For example, the recent development of a modified method (Cheung and Nelson 1996) of whole-genome amplification might eventually be applied to the original 500-µg DNA sample, generating perhaps as much as the equivalent of 200 mg of total DNA. CHARACTERIZATION OF GENETIC VARIATION Early studies of human variation in the 1960s and 1970s established the importance of choosing random loci for assessing population affinities. Including only known polymorphic-marker loci in variation studies leads to difficulties. The results are biased by the failure to account for the fact that not all genomic segments are polymorphic. The rates of change are exaggerated and the known polymorphisms (first detected in northern Europeans) might not yield unbiased estimates of variation. Using a large number of known polymorphic markers might be appropriate for making relative comparisons of variation, but it is not appropriate for making absolute comparisons, which are needed for accurate assessment of what has been called the tempo and mode of evolution. Although not now practically feasible (given the large number of polymorphic markers that are available), it will be possible in a few years to assess variation in a random collection of genomic sites about whose polymorphic nature nothing is now known. This type of study is free of much of the bias that afflicts many current studies of human variation. The emergence of nucleotide sequencing as a powerful tool for genetic analysis and the development of technology that allows rapid, accurate, and inexpensive sequencing will likely be a boon to this approach. Note that if random genomic segments are selected from genes and subregions in nuclear genes, repeated DNA, mitochondrial, Y-linked, and X-linked segments, the extent and nature of genetic variation might be related to the various biologic (genetic) differences among them.

OCR for page 36
Evaluating Human Genetic Diversity CLASSES OF DNA MARKERS Recently developed laboratory techniques enable systematic surveys of genetic variation in a wide variety of genomic segments, including coding sequences, noncoding sequences, 5' and 3' untranslated regions, various classes of repeated DNA, and extrachromosomal mitochondrial DNA. They have led to the discovery of various classes of polymorphic markers with different mutational properties. Scientists now have a unique opportunity to design studies, rather than analyze patterns of data gathered for other purposes, that can clarify crucial features of human variation, the relationships between groups, and human micro-evolution. It is commonly assumed that closely related populations will have similar numbers, types, and frequencies of alleles at any polymorphic locus, the differences increasing with the evolutionary distance between the populations. Thus, loci with a larger (rather than smaller) number of alleles and a uniform (rather than clumped) frequency distribution should be preferred, in that at such loci the probability of chance identity between populations is lower. Moreover, a large number of loci need to be investigated to reduce further the incidence of chance identity. The polymorphic alleles at a locus arise by mutation, so loci with high heterozygosity are expected, on the average, to have higher mutation rates than loci with lower heterozygosity. Because mutation will have the effect of erasing some of the historical evidence carried by polymorphic loci, loci with varied heterozygosities should be studied; different loci will illuminate different aspects (periods) of human evolution. Molecular-genetic analyses of human DNA have shown that polymorphic variation in humans arises from all the possible mutational mechanisms base substitution, insertion-deletion, and localized duplication-deletion and all of them are useful for measuring variation. Which specific type of marker is used will depend on the evolutionary questions being asked. Recent studies of human variation have shifted to the use of molecular-DNA markers because they are highly abundant. The Human Genome Project has completed its first goal of the discovery and mapping of human polymorphisms. Over 10,000 polymorphic human loci are known and mapped, and it is expected that variation studies will largely use these markers. In addition, a number of efforts at developing new types of markers are under way. Given their numbers, one can choose sets of markers with defined characteristics, and genotypes can be assayed with one method. Allelic variation includes base substitution and simple insertion-deletion, as well as variation arising from varying numbers of a simple-DNA sequence motif. Historically, DNA polymorphisms were detected as restriction-fragment length polymorphisms (RFLPs) by using cloned DNA probes, either specific for genes or at anonymous genomic segments. RFLPs are due to base substitution or small insertion-deletion differences that lead to the creation or loss of a restriction-enzyme recognition site. RFLPs usually consist of 2 alleles with an average

OCR for page 36
Evaluating Human Genetic Diversity heterozygosity of 25% and were discovered primarily in northern Europeans in the process of constructing a human genetic-linkage map. They generally have known map locations in the human genome and low mutation rates. In attempting to search for yet more polymorphic markers, it was recognized that loci at which alleles differed in the number of repeated (tandem) copies of a core DNA sequence (16-72 base pairs) were common in the human genome and highly polymorphic. Several hundred of these variable-number tandem-repeat (VNTR) loci have been discovered and mapped in the human genome; they tend to be in telomeric chromosomal segments. VNTR loci generally have multiple alleles with an heterozygosity exceeding 70% but also a high mutation rate. Classical RFLP and VNTR loci are assayed with the Southern blotting method, which is tedious, is time-consuming, and requires 5 µg or more of DNA per assay. Although they have been extensively used for gene-mapping studies and in forensic applications, they have seen little use in human variation studies. It is unlikely that RFLPs and VNTRs will be used as in the past, because the assay requires a greater degree of technical skill, greater access to a cloned probe, and larger quantities of DNA than other contemporary methods. Current experience from mapping the human genome suggests that the markers with the most-desirable properties are microsatellites, also called simple sequence-length polymorphisms (SSLPs) or short tandem repeats (STRs). Allelic variation at these loci arises from differing numbers of copies of a small tandem repeat, usually dinucleotides, trinucleotides, or tetranucleotides. These markers are very abundant; they occur once every 30 kb in the human genome and over 8,000 have been genetically mapped. STRs have multiple alleles with an average heterozygosity of 70% and can be assayed with PCR and very small quantities (nanograms) of DNA. Moreover, the DNA-sequence information required to synthesize the oligonucleotide primers necessary to analyze a locus are easily obtained electronically from international genome databases. The primers themselves can be inexpensively synthesized de novo or purchased commercially. Because SSLPs are highly informative, they are useful for a variety of human evolutionary studies, but they have a higher mutation rate than single-nucleotide substitutions. Thus, although useful for some studies, they might not be desirable for all variation studies. As a consequence, there has been renewed interest in developing a large set of human biallelic markers, including RFLPs, that can be assayed with PCR. These polymorphisms, which are expected to occur in the human genome every few kilobases, are abundant and are thought to have low mutation rates. Given their expected development over the next few years as a part of the Human Genome Project, the PCR-based biallelic markers will be polymorphisms of choice. However, it must be noted that there is an inevitable tradeoff between the selection of loci that are highly informative and heterozygous (accompanied by a high mutation rate) and loci that have lower heterozygosities (accompanied by a lower mutation rate). The specific choice will depend on the nature of a given study.

OCR for page 36
Evaluating Human Genetic Diversity Genetic variation in human populations can be studied with respect to specific genes or anonymous segments of DNA. A considerable body of literature shows that variation associated with specific genes is less than that associated with anonymous segments; but even within genes, variation can depend on whether coding sequences (and whether the first, second, or third position of a codon) or noncoding regions (5', 3' untranslated regions or introns) are studied. The extant variation within and between human populations is the outcome of both natural genetic processes in the genome and population-genetic factors that maintain them. Thus, whether genetic variation is evaluated in genes or in anonymous segments ultimately will depend on the specific questions being asked. In general, both types of variation will be studied to assess patterns of variation in gene versus nongene regions and can be used to answer questions about the importance of natural selection in the shaping of human genomic variation. In addition to collecting, storing, and distributing DNA samples, it might be appropriate to analyze systematically all or a portion of the DNA samples at a specified set of loci. Such an analysis could provide a balanced data set appropriate for making inferences about the historical relationships of human populations. Existing variation studies have established the importance of studying mitochondrial, Y-linked, X-linked, and nuclear variation. Each has a unique set of genetic characteristics, including mode of inheritance, mutation rate in male and female germ lines, and occurrence and rate of recombination. Each will provide a different view of human variation because they will illuminate the role of different genetic processes in the generation and maintenance of variation. Several kinds of markers might be surveyed, and these are considered below. DNA polymorphisms that are based on variation in the number of tandem repeats at a locus are detected with electrophoretic methods. These polymorphisms include microsatellites (SSLPs) and VNTRs. Many of the techniques used for studying them are used in the Human Genome Project. Oligonucleotide primers that flank a particular polymorphic region are used in PCR reactions. The PCR products can be labeled with radioactive or fluorescent tags. The sizes of the PCR products (and therefore the specific alleles) are usually determined by measuring their mobility in acrylamide gels. DNA-sequencing machines provide an automated approach to the size analysis of fluorescence-labeled PCR products. Newer technologies, such as electrophoresis in arrays of microcapillary tubes, might increase the speed of electrophoretic analysis. DNA POLYMORPHISMS BASED ON SINGLE NUCLEOTIDE SUBSTITUTIONS DNA polymorphisms based on simple nucleotide substitutions and insertions-deletions are easily analyzed with PCR. Once the polymorphic region is amplified, several current methods can identify the particular alleles in the prod-

OCR for page 36
Evaluating Human Genetic Diversity uct. If the alleles vary in the presence or absence of a particular restriction-enzyme site, allele detection can consist of enzyme digestion followed by electrophoresis. The alleles could also be identified without electrophoresis by using radioactivity- or fluorescence-labeled oligonucleotide hybridization probes that are specific for each allele (''dot blots"). An especially promising hybridization protocol in active development involves the so-called "DNA chip" technology. Small silicon chips containing hundreds to tens of thousands of oligonucleotide probes at known locations are constructed. Pools of PCR product from many different loci are annealed to the chip. The alleles at each locus in the sample are identified by determining the exact locations on the chip where the different products have annealed. Alleles can be identified rapidly at many loci simultaneously. Other potentially automated methods of allele typing include oligonucleotide ligation, microsequencing, and real-time quantitative allele-specific PCR (a technique known as Taqman) if high throughput at fewer loci is desired. Allele identification by direct sequencing of PCR product has the advantage of detecting every kind of genetic variation both previously known and unknown. Short-cut methods of studying regions previously uncharacterized for polymorphisms might also prove of value. PCR products can be analyzed with several electrophoretic methods that are capable of detecting simple nucleotide substitutions and insertion-deletion differences between different samples, for example, single-strand conformation polymorphism, density-gradient gel electrophoresis, and chemical cleavage. It should be noted that unless all measures of variation are at the level of a specific nucleotide sequence, there is always a chance that some allele-typing data will be compromised. Thus, 2 allele-specific hybridization probes for a biallelic polymorphism would incorrectly determine the genotype of a sample if it contained 1 known allele and 1 unknown allele at the same nucleotide or near it because neither of the probes designed to hybridize to the known alleles would hybridize to the new allele. SHOULD THERE BE A CORE SET OF DNA MARKERS THAT WILL BE SCORED FOR ALL SAMPLES IN THE REPOSITORY? It has been proposed that a common "core" set of genetic markers be genotyped in each sample accepted into the repositories, either as a repository activity or by individual scientists. The main reason for supporting this core marker genotyping effort would be the uniformity of assay conditions and data interpretation. An important consequence of core genotyping would be that a balanced and well-designed data set would be made available for statistical analysis of hypotheses concerning human evolutionary history, population genetics, and genetic epidemiology. Although that is a desirable goal, a number of difficulties surround the core genotyping concept. The types (STRs and SNPs), their chromosomal origin (X, Y, autosomal, and mitochondrial), the numbers of markers used, and the number of samples analyzed will depend on the specific questions

OCR for page 36
Evaluating Human Genetic Diversity asked and will lead to a long and ever-increasing number of lists of possible core genotyping experiments. These will be difficult to identify in advance and in constant need of revision as the questions and technologies evolve. Recent technologic advances, however, ensure that a large number of common markers will be genotyped in many people in multiple populations. Genotyping DNA samples took considerable effort in the recent past. An investigator had to make a major commitment to generating these data, and this required substantial funds. It would have been difficult to persuade any scientist not specifically interested in the genotyping results to carry out additional genotyping experiments. A variety of new high-throughput, multiplexed, inexpensive, and robust technologies are under development that could serve this additional genotyping. The use of one type of DNA "chip" in DNA typing has already been established (Ansari-Lari and others 1997; Chee and others 1996; Kozal and others 1996). Recent experiments suggest that it is possible to genotype 250 biallelic polymorphic-marker loci on a single DNA chip; it is likely that within a year chips with 2,000 such markers will be available (Wang and others 1996). The availability of such chips will make it very likely that most people will use them, rather than conventional methods and a set of markers specific to each investigator and varying widely among them. Another advantage is that in a single assay both anonymous markers and genes can be genotyped simultaneously, with large savings in labor and thus costs. As a consequence, all populations of any biologic or anthropologic interest can be genotyped with the same set of markers. We conclude that investigator-initiated efforts will naturally result in comprehensive screening of human genome variation. This is a better and more-flexible strategy than establishing a core set of markers in the very structure of the project. RESEARCH-MATERIALS MANAGEMENT The specimens acquired in the course of a coordinated human genetic-variation research effort will be its most valuable and enduring resource. It can reasonably be expected that dramatic advances will occur in laboratory-analysis technologies because of current research investments in robotics, automation, and high-speed DNA-marker detection and DNA-sequencing methods. It is plausible to envision that rapid, automated determination of large portions of a person's genome might become a routine laboratory procedure early in the next century and that such future capabilities will heighten the scientific value of stored sample collections representing populations throughout the world. Special attention must therefore be given to the management of the specimens acquired as part of an organized genetic-variation program. Three general models exist for the acquisition, processing, storage, and dissemination of biologic specimens: fully decentralized, centralized, and regional. Each has its inherent advantages and disadvantages.

OCR for page 36
Evaluating Human Genetic Diversity The fully decentralized model represents only a minimal change from the current status of human genetic-variation research. Specimens are acquired and stored locally by independent investigators, who perform different marker assays relevant to their own research interests on the samples and might or might not allow others to have access to their research materials. Such a scheme maximizes investigator autonomy and control and has the strength that investigators might have personal knowledge about the population studied which help to inform the design of new and related research projects. The existence of many repositories and the involvement of many persons increase the likelihood of scientific innovation, and the degree of independence and autonomy afforded to investigators is an incentive to participate in a large-scale coordinated effort. However, the fully decentralized model has inherent disadvantages. It is relatively inefficient in that it encompasses potentially large numbers of replicated laboratory and storage facilities. Quality-control procedures are difficult and expensive to implement, and availability of samples can be subject to loss of key personnel or other unpredictable events because some participating laboratories are small. The selection of samples available for sharing can also be subject to cultural biases, representing a single investigator's personal scientific agenda. In the absence of coordination between sites, a request to acquire samples with a particular set of characteristics might have to be made to many sites and investigators simultaneously. At the opposite end of the spectrum, a global genetic-variation research program might be built around a single coordinating site with a single laboratory that received specimens submitted from around the world, and that applied a standardized set of genetic tests to all samples. This centralized model would be expected to take advantage of economies of scale in specimen-handling and would have a lower relative cost than replicating analysis and storage facilities in many sites. In addition, quality control of analyses and control of access to specimens would be confined to a single site, and the auditing of numbers of specimens sent out and to whom they are sent would be simplified. The natural efficiencies of a centralized model are offset by many disadvantages. Primary among these would be the perception that the project was the province of a single country or a single group of investigators and the real or imagined exclusivity that might result. The institutionalization of the effort, with involvement of fewer people, might stifle innovation. A single repository would also potentially represent a single point of catastrophic failure: a natural disaster or other calamity might result in the loss of all acquired research materials. The most versatile model for a coordinated effort funded by US agencies would involve the establishment or designation of a relatively small number of regional centers, in different parts of the United States similar to the distribution of reagents used in the human genome project. Because the scope of human genome variation research is global, it would be optimal for the US effort to cooperate and consult at the international level with respect to establishing cen-

OCR for page 36
Evaluating Human Genetic Diversity ters in other countries. Multiple centers would have the advantage of providing backup in the storage and processing of specimens and would serve as foci for development of special expertise. By their location in diverse regions, such centers could more easily maintain awareness of local cultural, legal, and political issues while promoting the sharing of resources and technology. Moreover, the developing technologies (for example, chips) should make it possible to distribute cheaply and widely DNAs amplified from samples, thus also easing both political issues and resource requirements. At a minimum, close consultation between the United States and other countries will be necessary to ensure that common goals are achieved. A key disadvantage of multiple centers with backup storage of specimens is that the ownership of any individual sample will be transferred to 2 or more sites; ownership of specimens has been identified as an important concern of some groups of potential research participants. Formal procedures for exchange of specimens and frequent updating of collections at multiple sites to keep them in synchrony with one another will make the regional-centers model more expensive than the centralized model. Regardless of the model chosen for management of specimens, a number of issues will need to be addressed in the planning and execution of a globally coordinated research effort if it includes the determination of a "core" set of markers. Among these are development of standardized protocols for sample preparation, analysis, and storage and a quality-control mechanisms to assess compliance with them. The raw data on allele typing will be the basis for all the scientific conclusions that follow from the human genome variation project, so it is critical that the allele-typing data be accurate. Because different laboratories might type different populations with the same genetic markers, accurate comparisons between populations depend on the accuracy of allele identification in each. Ultimately, the nomenclature for all polymorphisms will be based on the position of variable sites in the context of the complete DNA sequence of the human genome. Meanwhile, a major effort must be made to provide working definitions, and consultation between participants must be initiated. Markers that are based on differences in electrophoretic mobility might prove especially difficult, and a common set of electrophoresis molecular-weight markers should be used by all participants in the project. A set of standard control DNAs should also be used to establish the ability of any laboratory to identify accurately specific allele sizes in analogy with the proficiency testing in forensic uses of DNA. Standard control samples should also be used by all participants in the identification of simple nucleotide polymorphisms with the methods described above. And allele nomenclature must be carefully considered. Other important considerations, regardless of whether core markers are a component of the effort, are the following:

OCR for page 36
Evaluating Human Genetic Diversity A resource-allocation mechanism to monitor and adjudicate requests for both renewable and nonrenewable research materials. A review mechanism for determining the scientific and ethical merit of requests for specimens (analogous to an institutional review board). A mechanism to detect and respond to unauthorized reuse of specimens for research not agreed to by subject populations. If individually identifiable specimens are collected, a procedure the committee does not advocate, then a mechanism must be established for recontact with and reconsent of participating groups and persons if currently unforeseen uses of specimens arise that are beyond the scope of the original informed consent. Enforcement of ethical protocols, especially the right of persons to withdraw their samples if the samples are personally identifiable. DATA MANAGEMENT The data produced by a coordinated human genome variation research effort, regardless of its scope, must be both accurate and internationally accessible to justify the investment in such an effort. The necessary data-management technologies and methods are relatively mature and economical, but the potentially sensitive nature of genetic information on persons and groups and the prospect that the data will be transported via public data networks, such as the Internet, might add requirements for information systems that support human genetic variation data management beyond the functions normally associated with collections of biologic data. The critical issue for data management is whether a data repository will contain information that can be used to link genetic data to specific individuals. If so, such a repository becomes, in essence, a medical record and must be subject to the standards that are applied to electronic patient records, particularly standards that concern information security and privacy. Unlike conventional medical records, a person's DNA is more than a component of current health states, it can also convey information about health risks (Annas 1993a), which might affect a person's employability, insurability, and standing in the community. A coordinated human genetic variation project would share and benefit from technologies developed for the Human Genome Project, including information technologies. Common to the 2 efforts are laboratory methods for data generation, such data items as representations of polymorphisms at particular genetic loci, DNA sequences of various lengths, and such statistical data as allele frequencies. Both are conceived as multi-investigator, geographically dispersed projects in which specimens and data are generated at many sites worldwide. Both need to acquire, store, and communicate primary observations and secondary observations, computed or inferred annotations and conclusions, maintaining an explicit labeling and separation of both types of data. Those similarities will

OCR for page 36
Evaluating Human Genetic Diversity simplify many aspects of the management of both physical and information resources, inasmuch as a coordinated human genetic variation research effort can build on the successes of existing international programs. Some forms of data errors, such as DNA-sequence errors in nonexpressed regions, are somewhat more likely to be interpreted as useful signals (that is, as genomic variation), because there are few automated error-checking rules that can be applied. As a result, standards for representing the confidence level of data items (for example, unverified, verified by independent assay, and reviewed by human experts) will need to be a component of the data-management design in a manner analogous to existing molecular-biology and genome-related databases. Because of the sensitive nature of personally identifiable genetic data, the security of data systems and networks will need to be addressed. Genetic-variation research will occur in an environment of evolving laboratory technologies that, with increasing efficiency and speed, might be capable of uniquely identifying at least some subjects without their consent, on the basis of unique patterns in their genomes combined with ethnodemographic and detailed phenotypic correlates. Procedures to minimize such breaches should be developed. Similarly, the correlation of ethnodemographic and other anthropologic data with molecular data will require formal and reproducible methods for representing and defining names of population groups, locales, languages, and other anthropologically important entities. In contrast with other genomic databases, geographic coordinates (latitude and longitude) will also be useful components of core human genetic-variation data. To maximize scientific return on investment, a coordinated human genetic variation research effort will need to make progress on several unresolved issues. These include requirements related to naming systems, acceptable security, and intellectual property rights. There are no widely accepted representation standards for machine-interpretable sociodemographic, ethnohistoric, and other anthropologic data. An internally consistent, understandable, and maintainable system for naming the peoples of the world and their self-reported social groups and languages will be at the core of scientific questions of human evolution, migration, and population structure. In this regard, a data repository will need to adopt the convention of a data dictionary that contains the explicit definitions used by the project for named groups; shorthand labels with implicit definitions will not suffice. The project should apply lessons learned in biologic naming systems; specifically, it should expect that the meaning of group names will change and that audit trails for relating prior definitions to current ones will need to be part of the data-repository design. To the extent that a human genetic variation data repository includes information linkable to specific persons and adopts security measures to safeguard them the project will be undertaking a social experiment in determining the acceptability of strong security measures, access, and audit controls in a scientific

OCR for page 36
Evaluating Human Genetic Diversity community not attuned to these issues. Most, if not all, scientists consider themselves to be ethical professionals capable of doing the right thing without outside interference and oversight. The imposition and enforcement of security measures common to electronic medical-records systems need not be burdensome, but clearly it will introduce elements of complexity and cost for system designers, administrators, and users alike. Intellectual-property rights are an additional unresolved issue. In a manner analogous to licensing agreements for patent rights on products derived from human genetic variation (see chapter 5), a decision will need to be made as to whether to hold copyright on the human genetic variation database. Copyright is commonly used to protect the economic value of a published work, but it can also serve as a basis to prosecute claims against misuses of or unauthorized changes in the data. If human genetic variation data are copyrighted by the organization that produces them, the question of allowing investigators or population groups to hold copyright on the use of the data provided by them will need to be addressed also. FUNCTIONAL REQUIREMENTS FOR HUMAN GENETIC VARIATION DATA Size of Proposed Data Sets The scope of any proposed project has implications for the ease with which data can be stored and communicated. For example, if an initial human genetic variation database comprised the aggregated data from 500 people in each of 100 populations, there would be 50,000 unique sample records. If each of those unit records comprised basic ethnodemographic data and 100 genetic markers totaling 2,000 bytes (characters), the resulting data collection would total about 50 megabytes. Databases that size are well within the data-storage capacity of desktop microcomputers and commonly available laptop or notebook computers. An expansion of the data by another factor of 10, to 500 megabytes, would still easily fit within a single compact disk (CD), and such media might be an attractive and inexpensive means of distributing the composite data worldwide. Dramatic advances in the speed and economy of DNA sequencing or marker identification could, however, cause data-management concerns to become the rate-limiting step in the progress of the project. If, for example, it became feasible to generate tens or hundreds of megabytes of genetic data per person economically, the submission and distribution of the data worldwide would tax the current capacity of both physical media and public data networks. However, given recent advances in storage capacity and bandwidth, if the same tempo of progress is maintained in the future, it seems unlikely that these potential limitations would be more transitory even if progress in data management were to slow somewhat.

OCR for page 36
Evaluating Human Genetic Diversity DATA ACQUISITION METHODS In general, there are 3 alternative data-management designs for a coordinated human genetic variation research effort, analogous to the management of specimens. The first is a global star network, where participating investigators send samples and data to a single coordinating site. The second involves the establishment of regional hierarchies; multiple coordinating sites would use formalized procedures for sharing samples and maintaining synchronized copies of data. The third is a fully decentralized specimen- and data-management scheme; data and specimens would be maintained locally and be available on demand to qualified investigators. As previously noted, of those designs, the star network might be most efficient but suffers from the risk that the project would be perceived as the province of a single country or a single group of scientists. The fully decentralized model maximizes autonomy but is inherently inefficient and makes unpredictable burdens on participating investigators. The most reasonable scheme for data acquisition involves the recording of core identifying data for each sample locally and submission of both a specimen and accompanying information to one of several participating local, regional, or international coordination sites. The committee strongly recommends the creation of electronic records at the point of sample acquisition. We recognize, however, that this might not always be possible, particularly in the developing nations; in these instances, well-designed, standardized paper forms should be used for initial data capture. Transcription errors would likely be minimized by creation of an electronic record at the point of sample acquisition. Obviously, the decreasing cost and increasing presence of microcomputers and digital-data networks argue for the creation of alternative pathways of data submission to a shared resource, including magnetic or optical media (floppy disks, recordable compact disk, or digital tape), Internet file transfer protocol (ftp), and submission via interactive forms, such as those available via the World Wide Web. QUALITY CONTROL AND ANNOTATION Experience with multi-investigator gene sequencing and mapping projects has shown that multiple investigators working in multiple laboratories will inevitably submit data that need additional format-checking and error-checking for missing or invalid values. Maintaining consistency of naming and the addition of annotations that depict features of interest generally require a designated facility and a group of persons trained to serve as scientific curators or editors. Some errors might be evident on inspection by knowledgeable reviewers or indicated by rule-based consistency checking with computers, but it will also be necessary to verify the accuracy of submitted data by repeating the laboratory analysis on a fraction of submitted samples. A number of statistical-sampling methods have been developed for quality control of laboratories; they generally rely on inde-

OCR for page 36
Evaluating Human Genetic Diversity pendent assay, by 2 or more laboratories, of a fraction of submitted samples or the periodic determination of a reference unknown to all participating laboratories. The specific approach chosen is less important than the commitment of the participating investigators and laboratories to adhere to systematically and consistently applied methods of quality control. Because all data collections contain errors and the minimization of those errors requires resources, acceptable error rates for specific types of data in a repository will need to be defined by participating investigators and biostatisticians. Those error rates will probably vary according to classes of hypotheses and the statistical power needed to make inferences about them. The success and credibility of the work done by multiple participating sites, investigators, and laboratories will depend on proof that reasonable quality-control standards are in place and are implemented. COMMUNICATION VIA PUBLIC NETWORKS A coordinated human genetic variation research effort will benefit from the recent emergence and rapid growth of the global Internet, a network of networks that provides a communication path among computers that is widely accessible at academic institutions, businesses, and residences around the world. Although the Internet is rapidly becoming ubiquitous, it is built on technical standards designed to facilitate information exchange and sharing, and it is not optimized for secure transport and exchange of data. For genetic data that cannot be linked to a person the openness and accessibility of the Internet are desirable characteristics. For genetic data that can be linked to a person, the transport of data over public networks adds security risks that must be recognized and minimized (as discussed below). Relatively inexpensive technologies for secure communication via the Internet are being developed and tested, and they will be available for data management in a coordinated human genetic variation research initiative, if such data are included as part of the project. ARCHIVAL STORAGE It may be reasonably expected that there will be multiple complete copies of the collected human genome variation data in laboratories and data centers around the world. That will confer a resistance to loss of data due to a catastrophe at any site, but it means that the data resource must be designed from the start to accommodate the updating of multiple archival sites in various ways. Prominent among these will be automated, network-based update transactions, received and broadcast by data centers in a manner similar to that used by GenBank and other international genome databases. An alternative approach would be to have a single gold-standard database copy with real-time access by sites in various na-

OCR for page 36
Evaluating Human Genetic Diversity tions (as by a World Wide Web forms interface) for updating and editing from multiple sites. However, the uneven availability of Internet access worldwide and the sometimes-unreliable nature of international telecommunication suggest that such a mechanism would need to be supplemented by a means of data exchange that uses physical media. As noted above, it is not thought that special computer hardware or unusually large data-storage capacity will be needed for the archiving of data from human genetic variation research. DISTRIBUTION AND ACCESS Existing genome-related and molecular-biology databases provide a model for distribution of and access to human genome variation data. Periodic distribution of physical media, such as CD-ROM containing part or all of the data resource, will be an economical dissemination mechanism. Internet utilities, such as ftp, could be used for bulk transfers and database updating among collaborating sites. Online access for individual queries via World Wide Web forms and specialized query programs that provide similar searching and selection of records based on various patterns or attributes of specific marker loci will be valuable to the scientific community. Resources should be provided in the project for continuing software development to ensure that the investment in data acquisition and maintenance is complemented by a corresponding investment in software tools that make access to the data easy for qualified investigators. Enforcement of rigorous access controls will be an issue, as noted above, to the extent that data in a human genetic variation data repository can be linked to specific people. SECURITY ISSUES RELATED TO HUMAN GENETIC VARIATION DATA In a human genetic variation data resource, in which submitting investigators cannot even identify samples and records that they have submitted for purposes of linking to individually identifiable information, security risks are minimal. Where such links can be constructed or discovered, however, security takes on a much more prominent role in human genetic variation information systems design. The Institute of Medicine has described 3 levels of data about people in its report on electronic medical records (IOM 1991). The first is "nonprivileged," which is least sensitive, not necessarily confidential, and can be accessed by or released to anyone without a subject's informed consent. A human genetic variation data repository, if it does not contain information identifiable with specific individuals, would be in this category. The second is "privileged," which includes illness-related data and, in the context of genetic variation, some types of phenotypic information. In general, in industrialized nations, privileged data

OCR for page 36
Evaluating Human Genetic Diversity are subject to government restrictions on release and informed consent and are distributed on a need-to-know basis. The highest level of restriction is on ''deniable" data, which are extremely sensitive and virtually always confidential, such as records of substance abuse, mental health, HIV and AIDS, sexually transmitted diseases, and genetic characteristics with social consequences (for example, employability and insurability). Disclosure of this type of information could result in substantial harm to a person. One conundrum of the attempt to classify information as deniable for international data repositories is that the basis for what could result in substantial harm to a person might depend on local cultures and norms. To the extent that any human genetic variation data resource contains information linkable to identifiable individuals, its design will have to anticipate security threats and types of security risks. Security threats exist for each of the 3 states of electronic information: transmission, processing, and storage. Security risks associated with each of the possible sources of threat—outsiders, negligent authorized users, and malicious authorized users—would need to be evaluated for each of those states. There are 5 basic types of security risks for personally identifiable data (Ford 1994): Disclosure loss of confidentiality or privacy Modification loss of integrity Fabrication loss of authenticity Repudiation loss of attribution Interruption loss of availability A credible model for human genetic variation data management will need to address each of those types of risk if the repository contains data that can be linked to specific persons. SUMMARY AND CONCLUSIONS The committee believes that at a stage when genotyping technology is evolving rapidly, it would be scientifically inappropriate and premature to designate a common core set of markers that is to be genotyped in all samples. Given advances in technology, a natural outcome will be that individual investigators will perform large-scale surveys of a large number of markers to generate balanced data sets. In spite of differences among individual investigators in sampling designs due to the different hypotheses being tested, many will use common technologies that can provide uniformity in the types and numbers of markers analyzed. With currently available laboratory and information technologies, the material- management and data-management aspects of a coordinated human genome variation research effort do not appear to constitute a serious barrier to implemen-

OCR for page 36
Evaluating Human Genetic Diversity tation of the project. There are multiple feasible models for specimen and data management and numerous instances of international cooperation in the creation of shared repositories of biologic tissue and data. The specimens and data to be captured, analyzed, and disseminated by the project have unique aspects, which will require attention and resources, but none of them is intractable. The most-important decision about project design will be whether it will acquire specimens and data that can be linked to identifiable persons and thereby need to meet a "clinical" standard for specimen and data security and access control. Recommendation 4.1: Blood samples collected from human populations should be converted primarily into purified DNA. Standard protocols would allow 10,000-50,000 assays to be carried out on DNA from a single 10-mL blood sample. The number of assays could be increased by a factor of 50-300 by multiplexing and using existing technologies designed for analysis of very small DNA samples. Transformed cell lines provide an essentially inexhaustible supply of DNA and are mandatory for some kinds of studies, but they require much more funding for their creation and maintenance.