Read "Evaluating Human Genetic Diversity" at NAP.edu

« Previous: 2 SCIENTIFIC AND MEDICAL VALUE OF RESEARCH ON HUMAN GENETIC VARIATION

Page 23 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

3
Sampling Issues

The creation of repositories of human DNA samples and corresponding databases on human genetic variation is associated with the following technical issues:

Should sampling be population based?
If so, what population-based sampling strategy should be used?
What human populations should be sampled?
How many of those populations should be sampled, how many people should be sampled from each, and how should the samples be chosen?

Regardless of sampling strategy, the introduction of any sample into a repository raises the following questions:

In addition to the samples themselves, what information should be collected from each person in the survey?
What types of samples should be collected (for example, blood, buccal cells, and hair roots)?
Should transformed cell lines be created?
How should the repositories of DNA or tissue samples be managed?

Once the samples are available for typing, still other questions arise, such as

Which loci should be analyzed, and what types of markers should be examined?

Page 24 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

How should databases containing information about the samples be constructed and managed?

Finally, a central question about the use of the samples is:

Should there be a core set of DNA markers that will be scored for all samples in the repository?

This chapter addresses the first 5 of those questions. The remainder are considered in chapter 4.

BASIC SAMPLING STRATEGIES

In most research projects, a scientist will determine the optimal sampling design from a specific narrowly defined set of hypotheses to be tested. That design will determine which people or populations are to be sampled and how large the samples should be. The main difficulty in designing a process to collect samples and develop corresponding databases for general use in research on human genome variation is that the samples and information are intended to serve as a resource for many researchers who are investigating different sets of hypotheses. The ideal sampling design would permit different researchers to extract appropriate subsamples relevant to their specific hypotheses. Although scientists have already identified some hypotheses to be tested with such resources, the samples would also be intended to be used to test others that have yet to be developed.

In general, as the number of populations or people in a sample increases, the ability to test a particular hypothesis (that is, the statistical power) is enhanced. Often, more hypotheses can be tested with larger samples than with smaller ones. For example, a sparse sample from a geographic area often cannot differentiate a single population with gene flow that is restricted by distance from fragmented populations that are separated by barriers that prevent gene flow (Templeton and Georgiadis 1996). As more populations are sampled in an area, it becomes feasible to test such hypotheses (Templeton and others 1995).

Nevertheless, it is impractical to devise a single sampling scheme that would be amenable to testing all possible hypotheses. For example, many sampling designs in medical genetics require case-control sampling (in which a subset of the total sample contains subjects who have a specific disease state and the rest of the subjects are matched to the diseased persons with respect to sex, age, or other variables), multigenerational pedigree data, or full-sib pairs. The specific requirements of such sampling cannot be anticipated until the hypothesis is well defined. Furthermore, use of a general-purpose sample-collection scheme will itself reveal gaps or weaknesses that will need to be corrected by the gathering of more samples. Hence, a sampling scheme must allow many potential hypotheses

Page 25 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

to be tested, but any recommended sampling scheme will necessarily restrict the types of hypotheses that can be tested with common sample repositories and databases.

Table 1 presents 5 sampling schemes that will be discussed in the following sections. The ones that are most restrictive—that is, permit the fewest hypotheses to be tested—are presented first. All but the first 2 (the most-restrictive alternatives) involve population-based sampling. After the sampling schemes are described, their strengths and weaknesses are compared to determine which one would be most appropriate for a coordinated global effort.

STRATEGIES THAT ARE NOT POPULATION BASED

Strategy I is the simplest sampling scheme because its sole requirement is that the sample be representative of the human species. Thus, the sample should not be derived from a single restricted group of human beings. This scheme would yield a sample that cannot be linked to specific persons, geographic areas, or populations. Each sample is identified simply as being from a human being, and no other information is obtained. This is the least-expensive type of sample to acquire, and its collection minimizes many ethical issues at both the personal and population levels (see chapter 5). Despite its simplicity, it could be used to test some important hypotheses discussed in chapter 2. In particular, it could be used to test hypotheses about human genome evolution, patterns of genetic variation relative to functional types of DNA and location in the genome, and the total amount of genetic variation found in the human species.

The other classes of hypotheses given in chapter 2 would not be amenable to testing with a sample that does not identify individuals, or geographic areas or populations, so this sampling scheme is associated with a narrow breadth of applicability.

Strategy II differs from Strategy I in that it records the geographic location of each sampling point, but the sample cannot be linked to particular persons or populations. The geographic points to be sampled could be chosen either by using a grid method or by sampling geographic areas in proportion to the density of populations in them. All the hypotheses related to genome evolution and patterns of variation that were testable with strategy I are also testable with strategy II, but in addition it is possible to test hypotheses related to the patterns of spatial variation and some hypotheses about the geographic subdivision of humans and patterns of gene flow or migration. Hence, the utility of the sampling design for testing hypotheses has been enhanced.

STRATEGIES THAT ARE POPULATION BASED

Strategy III is the first of the population-based sampling designs given in table 1. It records not only the geographic location of a sample, but also information

Page 26 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

TABLE 1 Sampling Strategies

	Non-Population-Based Sampling			Population-Based Sampling
	I Totally Anonymous	II Geographic Location	III II + Group Identification Data	IV III + Individual Phenotypic Data	V IV + Pedigrees
Testable Hypotheses	Genome evolution: Patterns of variation in the genome; Overall genetic variation in humans	Same as I plus: Description and determination of spatial variation (such as, variation of loci in space, migration)	Same as II plus: Patterns of migration, gene flow, and population subdivision; Hypotheses from anthropology, archaeology, history, and linguistics that should affect patterns of interpopulation variation; Preliminary studies on medically relevant loci; Population-level medical associations	Same as III plus: Identify specific loci for possible biomedical applications; Genotype interactions; Within-group variation on medical and phenotypic data; Associations between genes and phenotype at individual level	Same as IV plus: Detailed studies on disease-associated genes
Relative Costs	$	$$	$$$	$$$$	$$$$$

Page 27 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

provided about self-reported ethnicity, primary language, sex, age, and parental birthplaces. All the hypotheses that are testable with strategies I and II can be tested with strategy III, but this strategy broadens the universe of testable hypotheses to those related to population-level relationships and differences measured primarily with data on the frequencies of alleles (alternative forms of genes at the same locus) or haplotypes (particular states of a region of DNA—if the DNA region is a coding region, haplotypes correspond to alleles), and associated and derived statistics.

With strategy III, no individual identification or other phenotypic data would be gathered, and hence the sample could not be linked to specific individuals. This restriction is compatible with testing hypotheses about the evolution of the human genome, patterns of variation in the genome, the evolution of the human species, the geographic distribution of human variation, evidence of major migrations, and patterns of gene flow and subdivision. This scheme would also allow the testing of hypotheses arising from anthropology, archaeology, history, and linguistics, such as the timing of migrations or the spread of customs influencing patterns of reproduction, that should have detectable effects on patterns of human population-genetic variation. Also included are hypotheses in genetic epidemiology that can be tested with allele-frequency or haplotype-frequency data on populations—for example, studying HLA population variation to aid in transplant matching (a benefit primarily to minority groups in developed nations), testing for systemic and infectious disease and other phenotypic associations at the population level, measuring linkage-disequilibrium patterns (nonrandom associations between allele or haplotype frequencies in a population at 2 loci or positions in the genome) as an aid to positional cloning, and determining the genomic and geographic distribution of integrated viral sequences.

The hypotheses testable with strategy III could have value for both basic and clinical research. Work on recent human evolution and on patterns of past migrations and gene flow has yielded intriguing glimpses into the origins of anatomically modern humans (Goldstein and others 1995) and the spread of agriculture (Barbujani and others 1995; Weng and Sokal 1995). Such studies have also had a major effect on related fields of research, such as paleoanthropology (Frayer and others 1993), archaeology (Ammerman and Cavalli-Sforza 1984), and linguistics (Chen and others 1995). The recent studies on human genetic variation have generated controversy, largely because the available samples from human populations are geographically inadequate to test them (Templeton 1993). Thus, there remains a great need for the type of population-level samples that would result from coordinated global sampling of human genetic variation.

The research mentioned above addresses purely scientific questions, but it has important social and political implications. The identification of a person as a member of a particular ethnic or other social group has proved to be a major factor in the behavior of that person and others—our perception of ''who we are" influences how we treat others and react to them. There is still controversy about

Page 28 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

how good the available genetic data on humans are for studying recent human evolution. However, it is accepted that the major human "races" originated very recently and have undergone minimal genetic differentiation (Takahata and others 1995) or that they have exchanged genes repeatedly throughout all recent human evolutionary history, so there is only a single evolutionary lineage of humanity (Templeton 1994, 1996). Further sampling will help to determine which of those two alternatives is correct.

Other hypotheses testable with strategy III have biomedical implications at the population level. One subject of clinical relevance is the association between specific genes and human diseases, including multigenic disorders. A classic example is the population-level genetic correlation of malaria with the betathalassemia alleles and with the sickle-cell allele at the beta hemoglobin locus. More-recent studies have found associations of resistance to malaria with polymorphisms in the human leukocyte antigen (HLA) system and several other genes. In fact, studies elucidating the genetic factors that lead to resistance or susceptibility to infectious diseases are becoming common, as exemplified by recent work on tuberculosis and schistosomiasis.

A recent example is more enigmatic (Ansari-Lari and others 1997). An uncommon deletion of a portion of the chemokine receptor 5 gene (CCR5) is strongly associated with resistance to HIV-1 infection. In preliminary studies, this variant appears to be peculiar to populations of European ancestry and not to occur in people with ancestors in Africa, where HIV is thought to have originated. Samples from populations around the world would provide a more-definitive explanation of this finding that might have relevance to the treatment of HIV or improve the understanding of the origin and spread of this infection. Another example of the association of a marker with a systemic disease is allelic variation at the apolipoprotein E (ApoE) locus and Alzheimer's dementia. Also, inter-population allele-frequency differences at the ApoE locus are predictive of differences in the incidence of coronary arterial disease in those populations (Lehtimaki and others 1990; Tunstall-Pedoe and others 1994).

In all such research, identifying an association or a linkage disequilibrium between a disease variant and closely linked polymorphic alleles (that is, alleles at a locus that is close to the disease locus on the same chromosome) has been crucial for identifying candidate genes for the more-common disorders that are expected to have susceptible and resistant genotypes (Brice and others 1995; Ghosh 1995). Indeed, it has recently been suggested that polymorphisms in all human genes, once recognized, can be a powerful tool for mapping the genetic components of complex diseases or traits. Studies of polymorphisms can be especially informative if conducted in recently admixed populations (that is, populations that had been genetically isolated but have recently been interbreeding) or in isolated populations, where the recent history of admixture not only induces linkage disequilibrium over large genomic segments but might also increase the incidence of disease (Stephens and others 1994).

Page 29 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

The utility of such isolated populations is well illustrated by research on a relatively isolated population at Lake Maracaibo in Venezuela (Gusella and others 1983) that has a high incidence of Huntington disease. Studies of this population played a critical role in the ultimate cloning and identification of the gene responsible for Huntington's disease.

As those examples illustrate, population-level data on human genetic variation can have biomedical applications. However, to bring the applications to fruition often requires more-extensive sampling. For example, to identify the gene responsible for Huntington disease, it was necessary to sample the Venezuelan population and assemble extensive pedigree data based on that sampling (Gusella and others 1983). Similarly, to confirm the role of ApoE in coronary arterial disease, longitudinal studies on people were needed (Stengard and others 1995). Such detailed and specific biomedical hypotheses could not be addressed with strategy III. From a biomedical standpoint, a large-scale, coordinated effort with strategy III would provide a resource for initial screening to identify populations that would be most promising for detailed follow-up studies. It would then be the responsibility of individual investigators to design the sampling scheme needed to test their specific hypotheses and to resample the relevant populations.

Strategy IV, the second of the population-based strategies given in table 1, includes biomedically relevant information on individually identifiable phenotypes, particularly disease phenotypes. All the hypotheses mentioned in connection with strategy III could be tested with this scheme, but in addition one could look for genotype-disease associations instead of the much-weaker population-disease associations possible with strategy III. However, even such an enhanced data set would still be limited to disease-association studies and could not address disease causation directly. Hidden or unknown heterogeneity in the populations sampled could easily lead to false conclusions, and additional sampling (often the gathering of pedigree data) would be needed to confirm the results obtained with this strategy.

Those limitations can be avoided by going to a third level of population sampling, strategy V, the sampling of families or pedigreed persons in a population instead of persons of unknown relationship. When pedigree data are gathered with population and phenotypic data, more-definitive phenotypic studies are possible and they have enhanced power to detect markers close enough to disease loci to produce a within-family association even in the absence of a population-level association, as in the case of Huntington chorea in the Lake Maracaibo area of Venezuela (Gusella and others 1983). Moreover, when many closely linked marker loci exhibit heterozygosity, family data often allow the construction of haplotypes with more certainty. Once haplotype data exist, additional and more-powerful techniques for looking at genotype-phenotype associations can be used (Templeton and others 1987). Therefore, this form of sampling would greatly increase the biomedical utility of a human genome sample collection.

Page 30 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

WHICH SAMPLING STRATEGY SHOULD BE USED?

The above considerations indicate that repositories of population-based samples would be much more useful than repositories of non-population-based samples for addressing major scientific and medical questions related to human genome variation. Non-population-based sampling has other weaknesses.

Sampling strategy II does not avoid ethical and legal issues. It would evoke many of the ethical considerations (discussed in chapter 5) related to population identification that apply to population-based sampling strategies. That is because many human populations have strong geographic affinities, so in practice the population source of the sample could often be inferred with little effort. In some geographic areas where self-identified human populations of diverse origins are intermixed, such as major metropolitan areas in the United States, the sample could not be linked to a specific population or populations. Thus, one would end up with a mixed sample set, with some samples having easily identified population affinities and others not. That means that the hypotheses discussed in chapter 2 that require population identification would not be consistently amenable to testing with strategy II, even though many of the ethical issues of population identification would be incurred.

Sampling strategy I, being totally unlinkable to specific persons, geographic areas, or populations, would avoid most ethical and legal issues. Existing sample collections are sufficient to test hypotheses for which strategy I would be useful—no new collections would be needed. There have been sufficient studies to show that overall levels and patterns of human genomic diversity show little between-population variation in a quantitative sense (Barbujani and others 1997), so the hypotheses to be tested under sampling strategy I do not require sampling of any new or additional populations. Moreover, similar hypotheses have been tested with Drosophila (Aquadro 1992) using samples that are smaller than some human samples already gathered. Therefore, existing sample collections are large enough to test hypotheses under sampling strategy I; no new collection would be needed.

There are advantages and disadvantages to all 3 population-based sampling strategies. Although strategy III has the least utility for biomedical hypotheses (but great utility as a tool for generating hypotheses for follow-up studies), it has the advantage of circumventing a major source of potential controversy—the inadvertent identification of a participant (see chapter 5). There is no need for individual identification when only population-level hypotheses are being tested. If databases on human genome variation do not contain such information, the possibility of revealing a specific person's identity, either deliberately or through error, will be very small. With strategies IV and V, the data being collected would constitute medical records of the persons sampled and data-management requirements for security and confidentiality would increase substantially. More-

Page 31 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

over, there would be a risk of future revelations and adverse effects on persons and groups.

The collection of phenotypic data, as in strategies IV and V, could substantially increase the cost and time required to obtain samples, as well as the cost of data management and quality assurance. Collection of phenotypic data in the field could also greatly increase the complexity of sample collection, thereby reducing the participation of investigators who have limited resources in a coordinated sampling project. For a given amount of money and in a given period, the number of people or populations sampled would probably have to be much lower than what is possible with strategy III. Thus, although there would be a greater ability to test some biomedical hypotheses, the overall value and power of the collection for testing a wide range of hypotheses would decline because samples would be so much smaller.

Another difficulty with adding phenotypic data is the problem of deciding which phenotypes to include. The possible phenotypic measurements of biomedical relevance are virtually limitless, but if the sampling protocol is to be practical in the field, only a small number of phenotypic measurements per person could be made. Limiting the phenotypic collection to a small number of traits obviously would be useful to few researchers. Some measurements could be made on blood samples taken in the field, but many of these require either large amounts of blood or fresh, unfrozen blood, so they would be impractical for a large-scale sampling effort. Given that only a few phenotypes could be measured, there is no logical or fair method of choosing a standard set of phenotypes. Any attempt to choose such a set is likely to incur substantial time, expense, and logistic complexity for the project as a whole and to aid only a few research programs.

When pedigree data are added to the phenotypic data (strategy V), all the issues of cost, time, complexity, quality control, ethics, and so forth are exacerbated. Moreover, as will be discussed later, the best sampling scheme for testing population-level hypotheses is to avoid sampling of close biologic relatives; the gathering of pedigree data would inevitably increase the cost per sample, thereby almost certainly decreasing the overall sample size. That would substantially reduce the utility of the sample and the resulting data resource for addressing many of the questions discussed in chapter 2. Again, it would aid few research programs—only those addressing a very narrow range of genotypic and phenotype issues.

Sampling strategy III offers the best balance of breadth of testable hypotheses, expense, and ethical complications.

Recommendation 3.1: A coordinated global sampling effort to develop a common resource for research on human genome variation should use a population-based sampling design in which the geographic location of the sample and self-reported ethnicity, primary language, sex, age, and parental

Page 32 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

birthplaces are recorded. The committee notes that the inclusion of parental birthplaces with the other information identified above could, in some instances, inadvertently reveal a particular person's identity.

HOW TO SELECT WHICH HUMAN POPULATIONS TO SAMPLE

Many of the populations to be sampled are likely to be included as a spinoff from a project proposed for other reasons. However, it is important to keep in mind that the sampling will be open-ended and cumulative and that any large-scale project should make it possible to identify populations for future sampling subject to the periodic reassessment of existing samples. In choosing populations to be sampled, it should also be kept in mind that the primary purpose is to collect a sample that reveals the extent of genetic variation in the human species as a whole. Accordingly, everyone in the world should have a finite probability of being sampled; no population or group should be excluded in advance. The sampling scheme must therefore include not only linguistically unique populations, geographically peripheral populations, and so forth, but also human populations that are large, geographically widespread, and ethnically diverse. These large, widespread populations are also critical for testing population-level hypotheses. For example, hypotheses about the effect of technologic changes on population structure and gene flow patterns might be of interest, or HLA frequencies in recently admixed populations or genetic-disease associations with systemic diseases that affect primarily populations in developed countries. Under-representing such populations would yield a biased sample for studying overall human evolution and restrict the overall biomedical utility of the sample.

CONSIDERATIONS IN CHOOSING SUBJECT POPULATIONS

Within-population sampling strategies for a coordinated, global sampling effort will have to take into account the unique features of the populations to be sampled; no universal within-population scheme is possible. Nevertheless, a few guiding principles should be followed. Except for the pedigree sampling scheme (strategy V), population-level hypotheses in which the primary data type is allele or haplotype frequency are most efficiently tested when the persons sampled constitute, as far as possible, a random sample from the population that they are intended to represent. For most populations, sampling should be done so as to avoid first- and second-degree relatives unless pedigree data are to be obtained. The strategy for achieving that objective would vary from population to population. For large populations in developed nations, the sampling scheme should be stratified on the basis of geography and ethnicity; larger overall samples might be required in such populations because of the stratification. In other cases, one might wish to sample by village, clan, or other entity. For relatively uniformly distributed populations, a grid approach would be appropriate. In all cases, the

Page 33 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

guiding principle should be to obtain an adequate representation of the total population and to have the persons sampled be as unrelated (biologically) to one another as possible.

NUMBER OF POPULATIONS TO SAMPLE, AND NUMBER OF PEOPLE TO SAMPLE IN A POPULATION

The number of populations sampled is often controlled by matters of convenience and opportunity. However, the number of populations needed should be considered if current collections are to be improved. This section also considers the needs for the total number of people to be sampled in populations, even though it is recognized that samples in many populations might of necessity be very small and that technical, psychologic, or other difficulties might keep people from cooperating with a sampling project.

As mentioned before, the sampling is open-ended; the number of populations to be sampled will probably grow once a coordinated study of human genetic variation is initiated. Current resources are quite limited. For example, Cavalli-Sforza and others (1994) recently compiled the most-complete set of data on variation in the human genome. Because of noncomparable sampling, different loci being investigated, and so forth, they could assemble data on only 42 populations for 43 loci (24% of the cells in the data matrix are still missing) as the current sample to represent our best estimate of human nuclear genome variation.

For other kinds of genetic variation, the situation is worse. For example, although studies on human mitochondrial DNA have attracted much attention as a tool for exploring recent human evolution, testing with rigorous statistical criteria even fundamental hypotheses regarding whether mitochondrial variation spread around the globe through recurrent gene flow or population replacement is not possible, because few populations have been sampled in a geographically accurate manner (Templeton 1993). The only mitochondrial-DNA data set that even barely satisfied the sampling requirements of recent statistical tests designed to discriminate between gene flow and population replacement is that assembled by Excoffier (1990), which includes only 18 human populations. Consequently, even obtaining samples on 100 human populations would greatly augment our ability to test hypotheses about human evolution, population structure, and genome evolution, as well as disease associations at the population level. As the number of populations increases beyond 100, additional hypotheses could be tested (for example, discriminating between isolation by distance and population fragmentation due to gene-flow barriers).

Once the populations have been chosen, it is critical to have large-enough sample sizes within the populations. As previously stated, the size of the sample needed in any given instance is determined largely by the hypothesis. For example, if the investigator is seeking to test the historical genealogy of a particular population, a sample as small as 50 could be quite useful, but it would be inad-

Page 34 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

equate for characterizing genetic variability at a particular locus. In the past, human samples have been used to look for linkage disequilibrium among markers, which often is valuable in biomedical studies. However, to obtain statistically accurate estimates of linkage disequilibria often requires large samples (of at least several hundred). Another potential use of these DNA samples would be to examine different human populations for various genetic-disease alleles. Such alleles are generally rare (almost always appearing in less than 1 in 500 people, that is, an allele frequency of less than 1 in 1,000). The probability of having a rare allele(s) in a sample is roughly proportional to the frequency of the allele(s) and inversely proportional to sample size. For example, there is a 95% chance of an X-linked genetic-disease allele being in a sample of 250 males if its allele frequency is 0.012 in the population, but if the sample size is doubled to 500, there is a 95% chance of inclusion of an allele with a frequency as low as 0.006 in the population. Table 2 presents other sample sizes and their associated allele frequencies for a 95% chance of inclusion.

Another important potential biomedical application would be to examine heterogeneity in systemic diseases related to common alleles. The power for detecting statistically significant allele or haplotype frequency differences, which will often be small, also increases with sample size, as shown in table 2. Because human populations show so little overall genetic differentiation, large samples will be needed to perform such studies. For example, the e4 allele at the Apo-E locus on chromosome 19 has been shown to have a large and significant effect on the chance of death from coronary arterial disease, the largest cause of death in the developed countries (Stengard and others 1995). When published data on the frequency of the e4 allele in various countries were coupled with published incidences of death from coronary arterial disease in males by country, the regression

TABLE 2 Sampling Properties for Detecting Rare Alleles and Discriminating Between Allele-Frequency Differences in 2 Populations

Sample Size	Minimal Allele Frequency (p) of an Autosomal Locus with 95% Chance of Being in Sample	Minimal Allele Frequency (p > 0.1) in 1 Population Required for Discrimination from Second Population^a
50	0.030	0.274
100	0.015	0.216
200	0.007	0.179
500	0.003	0.148
1,000	0.001	0.133
a These minimal allele frequencies ensure a 90% chance of being significantly different at the 5% level.

Page 35 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

of death rate on allele frequency was positive and highly significant, explaining some 57% of the variance (Stengard and others 1997). That suggests that this allele increases the rate of death from coronary arterial disease in all populations, a potentially important biomedical conclusion. However, the total range of e4 allele frequencies is narrow, 0.1-0.2 in most populations (Gerdes and others 1992). To detect significant population differentiation and clinical effects over such a small frequency range requires large samples in each population. Given the overall genetic similarity of most human populations, the situation at the Apo-E locus is not likely to be exceptional.

Both rare and common alleles that predispose to systemic disease illustrate that the larger the within-population sample, the more useful the sample collection will be and the broader the variety of researchers to whom it will be useful. Much of the expense of obtaining samples is related to the logistics of going to where a population lives, so doubling sample size from 250 to 500, or even more, would often involve only a modest increase in sampling expense and effort.

Recommendation 3.2: For any given population, samples of a few hundred to several hundred persons, or even more, should be obtained whenever possible. In larger populations when the investigator deems stratified sampling to be necessary, larger overall samples would be desirable.

SUMMARY AND CONCLUSIONS

Of the various sampling strategies discussed and summarized in table 1, population-based sampling strategy III, in which only basic group-identification data are gathered, is preferred over the other strategies since neither the data nor the specimens can be linked to specific individuals&nldr; Strategy I does not provide a rationale for global sampling, and strategy II has many of the same ethical complications as strategy III but with a substantial restriction in breadth of testable hypotheses. Strategies IV and V could greatly increase the cost, complicate sampling logistics, raise serious ethical and security concerns, and benefit only a few investigators (although the investigations that would be so benefited have the most-direct biomedical relevance). Strategy III offers the best balance of breadth of testable hypotheses, expense, and ethical complications.

Page 23 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

Page 24 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

Page 25 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

Page 26 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

Page 27 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

Page 28 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

Page 29 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

Page 30 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

Page 31 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

Page 32 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

Page 33 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

Page 34 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

Page 35 Cite

Suggested Citation:"3 SAMPLING ISSUES." National Research Council. 1997. Evaluating Human Genetic Diversity. Washington, DC: The National Academies Press. doi: 10.17226/5955.

Next: 4 SAMPLE COLLECTION AND DATA MANAGEMENT »

Evaluating Human Genetic Diversity (1997)

Chapter: 3 SAMPLING ISSUES