Read "Biosocial Surveys" at NAP.edu

« Previous: 10 Genetic Markers in Social Science Research: Opportunities and Pitfalls--George P. Vogler and Gerald E. McClearn

Page 208 Cite

Suggested Citation:"11 Comments on the Utility of Social Science Surveys for the Discovery and Validation of Genes Influencing Complex Traits--Harald H.H. Göring." National Research Council. 2008. Biosocial Surveys. Washington, DC: The National Academies Press. doi: 10.17226/11939.

Page 209 Cite

Page 210 Cite

Page 211 Cite

Page 212 Cite

Page 213 Cite

Page 214 Cite

Page 215 Cite

Page 216 Cite

Page 217 Cite

Page 218 Cite

Page 219 Cite

Page 220 Cite

Page 221 Cite

Page 222 Cite

Page 223 Cite

Page 224 Cite

Page 225 Cite

Page 226 Cite

Page 227 Cite

Page 228 Cite

Page 229 Cite

Page 230 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

11 Comments on the Utility of Social Science Surveys for the Discovery and Validation of Genes Influencing Complex Traits Harald H.H. GÃ¶ring T he debate about the relative importance of ânatureâ and ânurtureâ in determining behavior and emotional and physical well-being has persisted to this day and is conducted among scientists, educators, parents, politicians, philosophers, and many other members of the society. In this phrasing, ânatureâ is used to represent those forces shaping our existence over which we have virtually no control (as of yet), namely our genetic constitution. âNurtureâ is used as a short form to suggest those influences on our life that we (think we) can shape to at least some degree, namely our environment. We are born with our genes, but we have some control over our exposure to certain aspects of the environment. It is clear that virtually all traits are influenced by both genetic and environmental factors. This applies to the rare and often serious diseases for which a genetic defect is necessary and sufficient to bring about dis- ease, but whose manifestation is nonetheless influenced by various other factors (that we often do not know) (Wexler et al., 2004). This likewise holds for infectious diseases, which are often viewed from the perspec- tive of environmental exposure to the pathogen alone. As we now know, the genetic constitution of the human host greatly influences the risk of exposure, infection and disease, as well as its course and severity (Allison, 1954, 1961; Kulkarni et al., 2003). The pendulum of prevailing social views keeps swinging back and forth between the two extremes. At the present time, the emphasis is clearly on the importance of genetic factors. The popular press contains daily reports of the discovery of yet another gene influencing yet another 208

HARALD H.H. GÃRING 209 disorder, and predictions that scientists will unravel the genetic mysteries of most conditions in only a few more years abound, often coupled with enormous promises for the prevention and cure of disease in the near future. Under these circumstances, it is not surprising that many individu- als have a very deterministic perception of the action of genes and think that there is a gene for every condition, with the condition being fully and accurately determined by this gene, independently of anything else. Geneticists are not blameless for this situation, as they often do not correct such views, unintentionally promote them by using sloppy terminology consistent with such opinions, or even intentionally further them by mak- ing exaggerated claims about the future impact of their area of research, perhaps in an effort to improve funding. It is in this environment that many researchers in other fields have begun thinking about whether they should and can incorporate gene discovery into their own studies. In this chapter, I comment on the utility of large-scale social science surveys for the discovery and validation of genes influencing conditions of interest to social scientists. I start with a brief overview of the nature of so-called complex traits and highlight some of the concepts behind study designs that are being used for the identification of genes. I attempt to contrast the traits for which gene-mapping studies have succeeded and the designs of gene discovery experiments to social science surveys, with a focus of the suitability of the latter for gene identification. I close with a few remarks on how such surveys may be useful for gene discovery and validation from my perspective. ETIOLOGICAL ARCHITECTURE OF COMPLEX TRAITS There is no accepted definition of what constitutes a so-called com- plex or multifactorial trait. The term is generally used to denote the oppo- site of a so-called Mendelian trait, in which a defect in a gene by itself can cause a specific phenotype (the focus is often on a disease). In contrast, the relationship between genotype and phenotype is not as deterministic in complex traits, for which individual genetic variants merely modulate the probability of presenting a particular phenotype. For many traits, we have absolutely no idea about the identity of envi- ronmental factors and genes whose variants account for some of the vari- ability in the phenotype in the population, and the designation of a trait as complex simply acknowledges the beliefâbased on common sense, failed gene mapping attempts, analogies with similar traits about which we have a better understanding, or evolutionary considerationsâthat a mul- titude of genetic and environmental factors must influence the phenotype. It may well turn out that a trait is not as complicated as first assumed, such as when gene mapping studies readily succeed in pinpointing the

210 BIOSOCIAL SURVEYS location of genes of substantial effect. The distinction between Mendelian and complex traits is by no means black and white. The terms merely refer to the two extremes, with most traits falling somewhere in the middle. Figure 11-1 provides a schematic view of the etiological architecture of a prototypical complex, multifactorial trait. The phenotype of an indi- vidual depends on a number of genetic and environmental factors that in concert determine the phenotypic outcome. The individual components of the etiological spectrum modulate the probability of manifesting a dichotomous phenotype, such as a disease, or they may alter the expected value of a quantitative trait. Not all components are equally important. genotype at major gene 1 penetrance detectance genotypes at individual other major environment genes phenotype genotypes at family polygenes environment cultural factors FIGURE 11-1â Schematic of the etiological architecture of a prototypical complex trait. NOTE: The key goal when designing a gene-mapping study is to reduce the etio- logical complexity as much as possible. In the ideal case, all influences on the trait of interest are eliminated except for the gene that is to be localized and identified. In this situation, the phenotypes of study participants are good predictors of the genotypes at the underlying gene, facilitating its discovery. Figure 11-1

HARALD H.H. GÃRING 211 Among the genetic factors, one generally distinguishes between major genes and polygenes. Major genes have a substantial influence on the phenotype and may be individually identifiable with gene-mapping approaches. Polygenes may have a sizable effect in the aggregate, but their individual influence is small, making them impossible to identify using statistical genetic methods. The various etiological components may act independently from one another, or there may be interaction, such as between different genetic factors (gene-gene interaction), between specific genes and aspects of the environment (gene-environment interaction), or among different environmental factors (environment-environment interaction). The term âinteractionâ is widely used in scientific discourse, but the word is often misused and misunderstood because it has different mean- ings in everyday language and in statistics (Cox, 1984). The fact that each of two factors influences a trait does not imply the existence of interaction between them. A well-known and often replicated, though not uncontro- versial, example of gene-environment interaction in complex trait etiology is the moderating effect of a variant of the 5-HTT gene on the impact of stressful life events on depression (Caspi et al., 2003). While genetic factors are shared among related individuals in predict- able ways, as a consequence of the simple rules of inheritance (in which an individual receives one complete set of chromosomes from each par- ent, with random selection of which copy of a given chromosome in a parent is passed on), no such rules for sharing of environmental factors among family members exist. Some exposures are shared among mem- bers of a household (household effect), cultural factors characteristic of a society may be shared among the members of the entire population, while other environmental factors are not shared within families but are unique to an individual. Another important distinction between genetic and environmental factors is that, while the rules of inheritance impose a correlation structure on the genetic factors, such that virtually all of them can be statistically assessed for having a potential influence on the trait of interest (see more on this point below), no such correlation structure exists for the environ- mental factors, such that it is not possible to investigate all components of the environment. The investigator thus has to decide beforehand which aspects of the environment are to be measured, and for many complex traits knowledge of which environmental factors may be relevant is inad- equate. From a gene-mapping perspective, all genetic and environmental factors except for the one gene to be located in the genome (denoted as major gene 1 in Figure 11-1) represent noise. The greater this noise is, the more difficult it is to map and identify any particular gene.

212 BIOSOCIAL SURVEYS STUDY DESIGN CONSIDERATIONS FOR COMPLEX TRAIT GENE IDENTIFICATION The description of the contrast between Mendelian and complex, multifactorial traits above highlighted the different degree of determin- ism in the relationship of genotype and phenotype. The probability of displaying a specific phenotype given an underlying genotype, or P(phenotype|genotype), is referred to as the penetrance. For Mendelian diseases, P(phenotype|genotype) â 1, that is, a defect in one or several genes will cause disease, and the disease will be absent otherwise. For complex diseases, the relationship is not as deterministic. In the extreme case, P(phenotype|genotype) â P(phenotype), that is, alleles of a gene have virtually no influence on the phenotype. Penetrances describe the unidirectional biological relationship between genes and traits: Genotype determines phenotype, but the reverse gener- ally does not hold. As biological quantities, however, penetrances are not under the control of the investigator, although they do depend on the genetic and environmental background. The reverse relationship, often referred to as detectance, measures how well the observed phenotype pre- dicts the underlying trait locus genotype, that is, P(genotype|phenotype). This quantity varies greatly with the design of a study. The power of a gene-mapping study is a function of the detectance (Weiss and Terwil- liger, 2000), and a key goal of designing a powerful study is to select a design in which the observed phenotype predicts the underlying trait locus genotype as accurately as possible, that is, P(genotype|phenotype) â 1. As a demonstration of the importance of study design on the power for gene localization, we previously computed the sample sizes to detect genes for retinitis pigmentosa (RP) under several different popular study designs based on different family sampling units (Terwilliger and Goring, 2000). RP is a serious eye disease (Bhatti, 2006) that is often considered an example of a Mendelian disorder. However, the disease is by no means monogenic, and there is considerable variation in the mode of inheritance (Rivolta et al., 2002). In some families, the disease is found in multiple generations, consistent with dominant inheritance. In other families, RP is observed in only a single sibship or in offspring of consanguineous âmar- riagesâ (i.e., the parents are related to each other), which is characteristic of recessive inheritance. Besides autosomal segregation, both dominant and recessive forms of sex-linked segregation are also found. What the different types of RP have in common is a high penetrance, that is, defects in various different genes cause the disease with high cer- tainty. However, the detectance is fairly low because of substantial locus heterogeneity (i.e., defects in many different genes can cause the disease).

HARALD H.H. GÃRING 213 In a single affected individual, in particular if the case is viewed in isola- tion, it is not clear which gene is responsible, and thus P(genotype at gene X|phenotype) is small. On the basis of the information provided in Heckenlively and Daiger (1996) and on several simplifying assumptions, we computed the sample size to detect significant evidence for a RP locus using linkage analysis on affected sib pairs (Penrose, 1935) (assuming that one does not attempt to stratify them by the mode of inheritance, which is generally not possible for complex traits). The required sample sizes run on the order of many hundreds to several thousands for recessive loci and tens of thousands for dominant loci. We also computed the sample sizes to detect the gene that is the most common cause of RP (rhodopsin) with trios consisting of parents and an affected offspring. This sampling scheme is popular, and such data are often analyzed using the transmission disequilibrium test (TDT) (Spielman et al., 1993). In this case, the sample sizes range in the tens to the hundreds of thousands. How is it possible that the sample size requirements for a relatively simple trait such as RP are this high for these two commonly used sam- pling schemes? The reason is found in the enormous locus and allelic heterogeneity, such that the relationship of disease to causal genes and variants is one to many. As a result, the disease status is a poor predictor of the underlying genotype at any of the many possible RP loci. By col- lecting extended families segregating the disease, the difficulties posed by the substantial heterogeneity have easily been overcome. By subgrouping these families based on the observed mode of inheritance, or by focusing on a single family of sufficient size, the genetic heterogeneity is greatly reduced, which permitted many different loci to be mapped (Rivolta et al., 2002). Within individual families, it can often be assumed (given the rarity of the disease) that all affected individuals harbor the same under- lying genetic defect, that is, there is a one-to-one relationship of disease and causal gene. Hence, the detectance within a family is very high, and linkage analysis should succeed in gene localization. Hereditary deafness is another good example of substantial locus het- erogeneity (Goldfarb and Avraham, 2002). These examples are given here to highlight the substantial impact that study design choices can have. RP and nonsyndromic hearing loss may be extreme examples of locus heterogeneity, perhaps due to the complexity of these sensory organs, but the brain is certainly much more complicated, suggesting that substantial obstacles may lie ahead for gene discovery for predominantly brain- related phenotypes. Many different strategies can be pursued to increase the power of a gene-mapping study. These strategies are generally not mutually exclu- sive but can be combined with one another, potentially to great advan-

214 BIOSOCIAL SURVEYS tage. The following briefly describes some of the available study design choices, yet the list is by no means exhaustive. There is considerable discussion about the relative merits of the various approaches in the sci- entific community. No approach is accepted as being the best universally. In fact, the choice of study design should be based on the nature of the trait being investigated. The debate about the pros and cons of various approaches is often phrased in terms of statistical methods and software implementations. However, in such a discussion it is important to rec- ognize the critical assumptions that underlie these statistical approaches and computer programs. The chosen study design largely determines the analytical tools to be used in the analysis. And a study should certainly be designed based on fundamental considerations, rather than the ease of analysis with a particular software package. Population Isolates One commonly used approach is to focus on an isolated popula- tion (Wright et al., 1999). The idea is that, within such a population, the etiological complexity is likely to be substantially reduced, with regard to both genetic variants and environment factors. A population that has received much attention for gene mapping is Finland (Peltonen et al., 1999), and many studies have also been conducted in Iceland, on French Canadians from the province of Quebec, Mormons from Utah, and the Amish and the Mennonite communities in the United States, among others. The genetic etiology may be simpler in such isolates because of a small founding population, such that the genetic material of the entire population goes back to a fairly small number of founder chromosomes, thereby limiting the amount of allelic variation, at least if admixture with other peoples has been absent or minimal since the population was ini- tially established. For the search for genes influencing rare diseases, these populations have proven to be extremely valuable (Peltonen et al., 1999). For the analysis of complex traits, however, it is not clear whether the founding populations were of sufficiently small size to simplify the allelic architec- ture to such a degree to make it tractable for genetic dissection (Hovatta et al., 1998, 1999). However, it is clear that these populations should be better suited for gene mapping than mixed populations with many differ- ent ethnicities and cultural practices (as are common in many U.S. cities). In addition, some population isolates have very large families, unusual family structures (such as consanguineous parental relationships leading to inbreeding), low rates of nonpaternity, excellent genealogical records, good health care infrastructure, willing study participation, and other features that can be a boon to genetic studies.

HARALD H.H. GÃRING 215 Extended Pedigrees Another strategy is to collect extended pedigrees. This should reduce the genetic complexity because the number of independent founder chro- mosomes among related individuals is much smaller than the number of independent chromosomes in a set of randomly picked, unrelated indi- viduals. Each founder individual contributes up to two different alleles at a given polymorphism, but, in the absence of mutation, descendants do not further increase the number of different alleles present in a sample, as they merely inherit copies of the founder alleles. Hence, in a case- control study of unrelated individuals, there would be up to twice as many independent allelic forms of each chromosome as there are study participants. This ratio is reduced to 4:3 for parent-offspring trios and 4:(2 + n) for nuclear families consisting of two parents and n siblings. In multigenerational families, the ratio depends on the width of the pedigree (i.e., the sibship sizes) and the depth of the pedigree (i.e., the number of generations). In addition, family members tend to be exposed to more similar environments than unrelated individuals living separately from one another, limiting the environmental complexity. The reasoning is in direct analogy to the situation of isolated popu- lations whose members tend to be more similar to each other geneti- cally and in environmental exposure than the members of other societies. Furthermore, if genotype data are available across multiple generations, then the segregation of chromosomal segments can be inferred more accurately. For these reasons, extended families tend to have more power to detect linkage per individual than smaller sampling units (Williams and Blangero, 1999; Blangero et al., 2003). And, if families are collected because they segregate some specific trait of interest, then within a family all members showing the trait may have it for the same genetic reason, as described above for RP. Ascertainment on Phenotype Especially in the analysis of a rare condition, it is necessary to enrich the sample for the presence of the condition of interest. A ârandomly ascertainedâ sample (i.e., a sample that is collected independently of the trait of interest) would contain few, if any, individuals with the condi- tion. If there is no phenotypic variation in a sample, then genetic factors that may cause variation certainly cannot be identified in that sample. It may also be advantageous to collect families that are densely loaded with affected individuals. The presence of multiple cases in a family generally increases the probability that genetic factors are a major deter- minant when compared with singleton individuals with the condition

216 BIOSOCIAL SURVEYS (Terwilliger et al., 2002). Furthermore, the affected individual within a family may well share the same genetic risk factors, as was argued above in the example of RP. Phenotypic Subtypes Another consideration may be to focus on subtypes of the trait of interest. For example, there may be considerable variation in the symp- toms of individuals afflicted with a particular disorder, including age of onset, severity, or combination of symptoms. Rather than lumping all cases together, it may be advantageous to focus on a particular subclass, even if this reduces the available sample size. For example, in the analy- sis of Alzheimer disease, studies focusing on early-onset forms of the disease have been much more successful in localization of susceptibility genes (Tilley et al., 1998). In the case of RP, grouping families by detailed symptoms and also the apparent mode of inheritance certainly makes a lot of sense. Eliminating Effects of Known Risk Factors Another consideration is to reduce the importance of known risk factors, which represent noise when attempting to localize new genes. In some cases, this can be done as part of the ascertainment scheme. For example, when attempting to identify genetic factors for lung cancer, one may focus on individuals or families with lung cancer despite the absence of smoking. Such individuals may be contrasted with lung cancerâfree individuals despite heavy smoking. Alternatively, one may obtain infor- mation on exposure to known risk factors and account for them by statis- tical means, such as by adjusting a quantitative trait for the effects of, say, sex, age, and other variables of known importance. This approach can also be used to account for previously identified genetic risk factors (Blangero et al., 1999; Martin et al., 2001). Biomarkers One may focus gene discovery efforts on intermediate phenotypes rather than the trait of ultimate interest (such as clinical disease end points). These intermediate phenotypes are often referred to as biomark- ers or endophenotypes. As shown in Figure 11-2, the conceptual basis of this approach is the concern that the trait of interest may be too far removed from the action of individual genes to make it a useful pheno- type for gene mapping. However, the trait may be influenced by interme- diate phenotypes that have a tractable genetic basis (Blangero et al., 2003).

HARALD H.H. GÃRING 217 Traits Biomarkers Genes FIGURE 11-2â Utility of biomarkers for gene mapping. NOTE: The trait of interest may be too far removed from the action of individual genes to allow gene localization and identification. However, the influence of genes on the trait may be mediated through measurable biomarkers (also called endophenotypes) that are much simpler in genetic etiology and thus permit gene discovery. Note that one gene may influence different traits and that different genes may influence the same trait. In this figure, the average biomarker is influ- Figure 11-2 enced by 1.8 genes (with a range of 1 to 3), while the average trait is influenced by 3.2 genes (with a range of 2 to 6). For such intermediate phenotypes to be useful biomarkers, they must be (genetically) correlated with the trait of interest. In addition, they must be âupstreamâ of the trait of interest rather than âdownstreamâ, that is, the endophenotype must influence the trait rather than the other way around. A problem for genetic research of behaviors, emotional and psycho- logical phenotypes, and psychiatric diseases in particular is that current understanding of the brain is so rudimentary that there are few promis- ing endophenotypes, although this situation is now beginning to change as a result of new techniques, such as improved imaging. Knowledge of other dimensions of human physiology is much more advanced, and many more biomarkers have been discovered and validated. For example, while cardiovascular disease is highly complex, such that the existence of a âstroke geneâ or a âheart attack geneâ that explains most of these events in the general population is unlikely, many risk factors for cardiovascular disease are knownâsuch as blood pressure, various cholesterol subfrac- tions, inflammatory markers, oxidative stress markers, blood clotting

218 BIOSOCIAL SURVEYS factors, etc.âand most genetic investigations that succeeded in localizing genes have focused on these biomarkers (Comuzzie et al., 1997) rather than the clinical disease end points. Quantitative Traits Many endophenotypes are quantitative in nature. Continuous traits are attractive targets for genetic investigations for several reasons (Blangero, 2004). They often provide greater power for gene localization (Blangero et al., 2000). The dichotomization of a naturally continuous phenotypeâsuch as categorizing individuals as obese or nonobese based on body mass index or grouping individuals into hypertensive and non- hypertensive groups based on blood pressureâis generally a poor idea, as dichotomization discards information on the underlying trait locus genotypes (Blangero et al., 2001). For example, a severely hypertensive person with a systolic/diastolic blood pressure of, say, 200/120 mm Hg is more likely to harbor genetic variants of substantial elevating effect on blood pressure than a marginally hypertensive individual with a blood pressure of, say, 140/90 mm Hg. âRandom ascertainment,â that is, selec- tion of study participants independently of the trait of interest, is a com- mon design of genetic studies of quantitative traits. The reason this is a suitable strategy is that there is ample variation in the quantitative phenotype in such a sample. This stands in contrast to the investigation of rare diseases in which random samples would contain exclusively or mostly healthy individuals. While nonrandom ascertain- ment may clearly be used for continuous phenotypes, such as by enrich- ing the sample for individuals with extreme phenotypes from one or both tails of the quantitative distribution, the advantage of such a protocol over random ascertainment is less clear, and attempting to collect larger families may yield greater benefit. Besides the concepts discussed above, many additional strategies may be pursued in collecting samples for complex trait gene mapping. The common logic behind these approaches is to make the genetic etiol- ogy tractable. The goal is to convert, through smart study design, a com- plicated trait with many etiological factors into a simple trait controlled by a few factors that have substantial importance and are individually identifiable. In a way, in studies of a nonexperimental organism such as humans, we attempt to take advantage of naturally occurring events (such as unusual phenotypes, families, or populations) to get as close as possible to the simplicity of experimental animal models, for which the experimenter has control over both genetic and environmental factors. Many complex trait gene-mapping studies employ nonrandom ascer- tainment protocols, such as the preferential collection of families with

HARALD H.H. GÃRING 219 multiple affected individuals, as described above. As a result, such sam- ples are not reflective of the population as a whole, and the effect size of any locus, gene, and polymorphism may differ between such nonrandom samples and the general population. While this would be a serious draw- back for many types of investigations, researchers attempting to identify genetic factors tend to accept this as a condition of having any power for gene localization and identification. In addition, the fact that a sample ascertained on a specific phenotype is different from the wider popula- tion is not crucial, because, even in randomly ascertained samples, it is nearly impossible to simultaneously identify genetic factors and estimate their effect size, because of what is sometimes referred to as the winnerâs curse. Complex trait gene-mapping studies tend to have fairly low power. Many different pointwise tests are conducted when attempting to localize genetic factors in our genome. As a result, considerable luck is required for successful gene localization, and the effect sizes at peaks of the map- ping statistic tend to be greatly inflated (Utz et al., 2000; Goring et al., 2001). While various investigators have attempted to correct these biases (Allison et al., 2002; Siegmund, 2002; Sun and Bull, 2005), additional samples will often be required anyway to obtain more accurate estimates of the frequency and impact of genetic variants. Furthermore, given the genetic differences between different populations, multiple samples must be analyzed to establish the consistency of findings across populations and ethnicities. Gene Expression Levels as an Example of Extreme Endophenotypes Due to recent technological advances, the concentrations of proteins, RNA molecules, and metabolites are increasingly being used as targets in gene-mapping experiments. Transcript abundance and protein levels may be viewed as extreme endophenotypes that are presumably much closer to the action of individual genes than complex human characteristics. The analysis of gene expression levels is now quickly becoming routine, because of the recent commercial availability of microarrays containing probes for vast numbers of transcripts, making it possible to investi- gate nearly the entire known transcriptome in a single experiment. For recent reviews of genetic analysis of expression profiling, see de Koning and Haley (2005); Gibson and Weir (2005); Pastinen et al. (2006). Expres- sion profiles have recently been generated for lymphocyte samples from 1,240 Mexican American participants in the San Antonio Family Heart Study (Mitchell et al., 1996), the goal of which is to analyze the genetic underpinnings of atherosclerosis in Mexican Americans. We have recently published an initial paper on the genetic regulation of gene expression

220 BIOSOCIAL SURVEYS (GÃ¶ring et al., 2007). This example is mentioned here to caution against overly optimistic views on the prospects of complex trait gene-mapping experiments. A total of 19,648 unique probes detected substantial abundance of the autosomal transcript being targeted and were subjected to statistical genetic analyses. Using a variance components-based approach and the software package SOLAR (Almasy and Blangero, 1998), we observed that, at a 5 percent false discovery rate (Benjamini and Hochberg, 1995), 85 percent of the expression phenotypes are significantly (additively) heritable (i.e., genetic factors in aggregate account for some proportion of the phenotypic variance). This is perhaps not unexpected, given that gene expression phenotypes are about as close to gene action as possible. However, the heritability estimates of many transcripts were modest (46, 68, and 87 percent of transcripts have heritability estimates less than 20, 30, and 40 percent, respectively), hinting at a substantial influence of the environment or physiological state of the individual at the time of blood draw. In an effort to localize major genetic factors influencing the expres- sion levels of individual transcripts, we performed genome-wide variance components-based linkage analysis (Amos, 1994; Almasy and Blangero, 1998). Figure 11-3 contains a scatterplot of the transcript-specific maximum linkage statistic by the estimated heritability. The figure is included here to make two points: First, no genes were reliably localized in linkage analysis of many transcripts. In fact, using the customary threshold of a lod score of 3 as the criterion (the lod score, short for logarithm of odds score, is the logarithm to base 10 of the likelihood ratio statistic that com- pares the likelihood of the alternative hypothesis of linkage and the null hypothesis of no linkage; a lod score of 3 thus corresponds to a likelihood ratio of 1,000:1; asymptotically, the genome-wide significance correspond- ing to this criterion is ~0.05 in humans), only ~10 percent of transcripts had significant linkages. This points out that the etiology of even such seemingly simple traits can be quite complicated, highlighting the chal- lenges faced by investigators attempting to identify the genes underlying complex human conditions and behaviors. Although measurement error in transcript abundance undoubtedly reduces power, the vast majority of expression phenotypes are nonetheless substantially heritable, suggesting that measurement error alone is not the main reason for why no locus was identified for many transcripts. Second, the heritability estimate is a poor predictor of âmappabil- ity.â This is illustrated in Figure 11-4. Heritability assesses the influence of genetic variation in the aggregate on phenotypic variation. However, a substantial heritability does not indicate the existence of major genes. While there is a positive correlation between the heritability estimate

HARALD H.H. GÃRING 221 10 9 8 Maximum Autosomal Lod 7 6 Score 5 4 3 2 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Heritability Estimate FIGURE 11-3â Relationship of heritability estimate and maximum lod score. NOTE: The figure shows the maximum lod score (obtained in variance components- based linkage analysis across the autosomal chromosomes) as a function of the estimated heritability for the expression levels of 19,648 autosomal transcripts. The expression phenotypes were generated on lymphocyte samples from 1,240 Mexican American family members, participants in the San Antonio Family Heart Study (Mitchell et al., 1996). The expression phenotypes were normalized by an Figure 11-3 inverse Gaussian transformation and were adjusted for the overall RNA levels in a sample and for the effects of sex and age. The vertical axis is truncated at 10 (the maximum obtained lod score was > 50). and the probability that a major locus exists, this relationship is very imprecise. Another example that dramatically illustrates this point is human height. We have investigated normal variation in adult stature in nine extended pedigree samples from several different ethnic groups and countries (GÃ¶ring et al., 2004a, 2004b). The samples comprised nearly 7,500 phenotyped and genotyped individuals in total. After accounting for the sexual dimorphism in height and for the effects of age, height was found to be highly heritable (with heritability estimates ranging from 63

222 BIOSOCIAL SURVEYS Polygenic Oligogenic Heritability 0% 100% FIGURE 11-4â Heritability and genetic architecture. NOTE: The heritability estimate of a trait is a poor predictor of whether the trait has a âpolygenicâ or âoligogenicâ genetic architecture. In polygenic inheritance, the genetic factors individually have a small influence on a trait of interest, mak- ing their identification impossible by statistical genetic approaches. In oligogenic inheritance, one or several genes have substantial phenotypic effect, potentially allowing them to be localized and identified. to 92 percent in these samples). Nonetheless, most samples failed to yield a significant lod score in genome-wide linkage analysis, and the few lod score peaks that exceeded the threshold of 3 were inconsistent between the different samples. Other groups have conducted similar studies, but on a smaller scale (Hirschhorn et al., 2001; Perola et al., 2001; Sammalisto et al., 2005). While some interesting linkage results were obtained, in my estimation the findings are consistent with the view that height in the general population, while highly heritable, is a good example of a largely polygenic trait that is quite intractable using standard gene-mapping approaches.

HARALD H.H. GÃRING 223 SOCIAL SCIENCE SURVEYS: A TOOL FOR GENE DISCOVERY OR VALIDATION? The etiological complexity of multifactorial traits poses an enormous challenge for efforts to localize and identify the underlying genes. The chance of success will be greatly improved if studies focus on biological characteristics that are amenable to genetic investigations and if studies are designed in order to reduce the complexity as much as possible. In the preceding sections, some strategies for simplifying the etiological architecture have been presented. Social science surveys come in many shapes and sizes, and it is thus difficult to provide general comments on their utility for gene localization, identification, and validation. Whether or not a particular cohort may be useful, these purposes depend on the many specific features of a par- ticular survey. The remarks here are very general. They do not do justice to the uniqueness of individual surveys and do not highlight the many exceptions and caveats. For a more nuanced perspective, the reader is referred to the detailed descriptions of specific surveys that are provided in various chapters of this book. Social, political, economic, and demographic surveys have very different goals from gene-mapping studies, and thus they are typically designed in a very different and even diametrically opposed manner. Often surveys are conducted in a way that ensures that they are as repre- sentative of the surveyed population as possible. This design goal tends to enrich the diversity of environmental factors as well as genetic factors that influence a characteristic of interest. In many cases, surveys conducted in the United States include individuals from many different ethnic groups, due to the ethnic diversity of this country and the desire to understand the importance of racial identity on social conditions. This additional pheno- typic variance due to both genetic and environmental factors is generally best avoided in gene-mapping studies, although âadmixture mappingâ is a gene-mapping approach that seeks to take advantage of any existing allele frequency differences between the founder populations comprising an admixed society (Smith and OâBrien, 2005). The large size of most surveys necessitates the use of questionnaires or measurements that can be done quickly and cheaply. There often is no time or money for in-depth assessment of characteristics. Ethical con- cerns may also limit what types of measurements can be conducted on a population-wide scale. Surveys in the social sciences are often primarily interested in behaviors, characteristics, and phenomena that are likely far removed from the influence of individual genes. There is no reason to believe, based on evolutionary considerations, practical gene-mapping

224 BIOSOCIAL SURVEYS experience, and common sense, that a characteristic such as, say, socioeco- nomic status is amenable to dissection by statistical genetic approaches. The multitude of factorsâsuch as intelligence, talent, likes, and ambi- tion, as well as family environment, parental socioeconomic status, qual- ity of schooling, type of local economy, availability of jobs, and many othersâappears too large to make gene-mapping strategies appear likely to succeed. The brain is so complex that it is extremely difficult to identify the genetic factors underlying even gross disturbances, such as in severe psychiatric disease. The large size of many surveys is wonderful attribute. However, it is unlikely that an increase in sample size alone can readily overcome the challenges of truly complex phenotypes and the drawbacks of study designs that are suboptimal for gene mapping. It appears likely that genotyping technologies will continue to improve so quickly that it will soon be feasible to obtain the entire DNA sequence of all study participants. However, even if complete sequence data were available for all 300 million U.S. residents or even all 6 billion humans on earth, the identification of causal genetic factors influencing human behaviors and similarly complex traits would still be very challenging. In all likelihood, subsetting strategies would be employed that mimic what is now being done through the selection of suitable populations and specific ascertain- ment criteria. For these reasons, social science surveys are probably a poor tool for gene localization and identification in general. However, this conclusion does not address the question whether or not such surveys are suitable for âreplicationâ of candidate genes previ- ously identified elsewhere. One reason for the lack of power in gene- mapping experiments is the extremely large hypothesis space. The human genome spans about 3 billion base pairs of DNA and harbors about 30 thousand different genes (although this latter number is quite uncertain and depends on the definition of a gene). The rules of genetic inheritance impose a high degree of correlation among these many factors, mak- ing it possible to exhaustively evaluate all hypotheses. No specific prior hypotheses are thus required, and gene-mapping experiments may be better viewed in the framework of estimation (i.e., in which the genes influencing a trait are located) than in the framework of testing (i.e., if a gene influencing the trait of interest is located at a specific position). However, the number of independent tests used to span the genome is still substantial: approximately 500 independent tests are required to span the genome with linkage analysis, and many more are required for genome-wide association analysis. The large number of âtestsâ substan- tially reduces the power of genome-wide mapping studies. In contrast, once a specific factor has been identified as being a likely contributor to some trait of interest, a specific hypothesis is at hand that can be tested.

HARALD H.H. GÃRING 225 No multiple testing applies in this situation, greatly improving power. Such validation is important for several reasons: â¢ Given the lack of power of many complex trait gene-mapping studies, many of the published results may represent false posi- tive findings. Replication by others is thus necessary in order for such findings to be accepted. â¢ As mentioned above, it is next to impossible to map genes and at the same time estimate their effect sizes reliably. NaÃ¯ve estimators tend to be greatly inflated, for reasons of selection bias (GÃ¶ring et al., 2001). While procedures for bias reduction are actively being developed, the estimation of effect size in independent samples remains crucial. â¢ Samples for gene-mapping studies are often not representative of the population as a whole. They are generally restricted to one ethnic group or are greatly enriched for the biological condition of interest. Thus, it is of great interest to assess the generality of findings in the wider population and in different population strata and ethnic groups. Some characteristics that are typical of many social science surveys suggest that they may be highly suitable for examining individual genetic variants with substantial prior support. The large size of many such sur- veys should in principle allow for very precise estimation of allelic effect size. And the fact that the sample is typically collected to be representative of the wider population should in principle improve effect size estimation when compared with deriving such estimates from samples that are ascer- tained based on the trait of interest, as no correction for ascertainment is necessary (which is notoriously difficult). If surveys cover individuals from many different ethnic groups, then the replication of previously identified genetic variants can establish the generality of the findings in many different ethnicities within one survey cohort, assuming that reli- able information on ethnic origin is available. While social science surveys may in principle be useful for validating previously identified genetic variants, survey cohorts are by no means universally useful for that purpose, nor are they the only or necessarily the best type of sample available for replication. In many instances, inves- tigations of the genetic etiology of human traits are conducted by many research groups simultaneously, either in collaboration or in competition with one another. Hence, once one of these groups identifies a particular gene and/or variant and announces the results, replication studies are often done very rapidly, because many other cohorts in which the trait of interest has been measured are at hand. Given the international nature

226 BIOSOCIAL SURVEYS of human gene identification efforts, these cohorts are likely to come from many countries and ethnic groups, such that the importance of the identified variants can be established for many human populations. In this situation, the need for and the utility of social science survey cohorts as further replication samples may then well be limited or nonexistent. And while large surveys can in principle be used for validation of spe- cific genetic variants by association analysis, these cohorts will often not be useful for replicating findings of heritability or linkage, which require information on familial relationships that are missing from many social science surveys. If social science surveys are to be used for gene validation, two condi- tions must be met: The phenotype of interest must have been assessed in the survey, and DNA samples must have been collected. To increase the range of phenotypes that can be used for gene validation, it may poten- tially be useful to collect, as part of such surveys, biological specimens (such as saliva, blood, and urine) that can be obtained easily, cheaply, and at minimal risk to survey participants. A vast number of biomarkers underlying complex diseases have been detected and characterized in these readily available tissues (for details, see the chapters devoted to this topic). If DNA and/or biological specimens are to be collected in social science surveys, they should be designed for this purpose from the outset rather than retrofitted later. Thus, the pros and cons should be carefully weighed beforehand. These considerations should take into account both the many practical issues (such as cost, speed, effect on study participa- tion, transport, storage, etc.) and also the expected value of a given survey for later genetic investigations. For example, if a survey is conducted in a population that is rarely investigated, then the collection of DNA may be more useful than in a well-studied population. And the more expen- sive or difficult a trait or its associated environmental risk factors are to measure, the greater the potential value of DNA samples collected con- currently. Finally, it is paramount that ethical concerns are given careful consideration. CONCLUSION The etiological architecture of complex traits presents a formidable challenge to attempts to identify the underlying genetic components. To overcome this difficulty, gene-mapping studies should be designed in a manner that reduces the etiological complexity as much as possible. Enormous sample sizes, large-scale genotyping and sequencing efforts, sophisticated statistical approaches, and fast computers cannot compen-

HARALD H.H. GÃRING 227 sate for a trait that is poorly chosen or for a study that is poorly designed. In general, social science surveys tend to focus on characteristics that are likely to be far removed from the action of individual genes. In addition, these surveys are designed with different goals in mind. For these rea- sons, it appears that social science surveys will often be poorly suited for complex trait gene discovery. At the same time, the large size of many social science surveys, and the fact that by design the survey sample is often representative of the target population, make such surveys potentially useful for gene valida- tion. This usage would permit âreplication,â that is, confirming the role of specific genetic variants in specific traits, allow for a more accurate esti- mation of the effect size of specific genes and functional variants therein, and allow gene effect size comparisons between different populations and ethnicities. In order to be used in this manner, surveys would need to col- lect tissue samples that allow DNA extraction for genotyping and poten- tially also biological specimens permitting phenotyping. Careful consid- erations must be given to ethical concerns that may arise if genetic data or other information are generated or analyzed by individuals who are not involved in the survey. These issues should be addressed before a survey is conducted, and appropriate informed consent must be obtained. This chapter has focused mostly on the potential utility of social sci- ence surveys for complex trait gene discovery and validation. However, social science and biological or genetic research obviously influence and aid each other in a myriad of ways. For example, if a genetic factor is known to influence some human condition that is of interest in a particular social science survey, then genotyping this genetic factor may be of great help when attempting to identify other explanatory variables underlying the condition. Conversely, known environmental risk factors, including those detected and characterized in social science studies, should certainly be measured and accounted for in genetic studies attempting to identify novel genetic risk factors. After all, the vast majority of human traits are influenced by both genetic and environmental factors, and we should not let ignorance or dogma (including whether a study should still be called a social science survey if it also collects biological specimens) get in the way of using any relevant information to advance the research. From a more distant perspective, the overarching question regarding the use of genetic indicators in social science surveys is whether it is more useful to have a larger number of smaller and more specialized studies or fewer studies that are larger and more general. As is so often the case, the correct answer probably is âit depends.â

228 BIOSOCIAL SURVEYS REFERENCES Allison, A.C. (1954). Protection afforded by sickle cell trait against subtertian malarian infec- tion. British Medical Journal, 1, 290-294. Allison, A.C. (1961). Genetic factors in resistance to malaria. Annals of the New York Academy of Sciences, 91, 710-729. Allison, D.B., Fernandez, J.R., et al. (2002). Bias in estimates of quantitative-trait-locus effect in genome scans: Demonstration of the phenomenon and a method-of-moments proce- dure for reducing bias. American Journal of Human Genetics, 70(3), 575-585. Almasy, L., and Blangero, J. (1998). Multipoint quantitative-trait linkage analysis in general pedigrees. American Journal of Human Genetics, 62(5), 1198-1211. Amos, C.I. (1994). Robust variance-components approach for assessing genetic linkage in pedigrees. American Journal of Human Genetics, 54(3), 535-543. Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57(1), 289-300. Bhatti, M.T. (2006). Retinitis pigmentosa, pigmentary retinopathies, and neurologic diseases. Current Neurology and Neuroscience Reports, 6(5), 403-413. Blangero, J. (2004). Localization and identification of human quantitative trait loci: King harvest has surely come. Current Opinion in Genetics & Development, 14(3), 233-240. Blangero, J., Williams, J.T., et al. (1999). Oligogenic model selection using the Bayesian Infor- mation Criterion: Linkage analysis of the P300 Cz event-related brain potential. Genetic Epidemiology, 17(Suppl 1), S67-S72. Blangero, J., Williams, J.T., et al. (2000). Quantitative trait locus mapping using human pedi- grees. Human Biology, 72(1), 35-62. Blangero, J., Williams, J.T., et al. (2001). Variance component methods for detecting complex trait loci. Advances in Genetics, 42, 151-181. Blangero, J., Williams, J.T., et al. (2003). Novel family-based approaches to genetic risk in thrombosis. Journal of Thrombosis and Haemostasis, 1(7), 1391-1397. Caspi, A., Sugden, K., et al. (2003). Influence of life stress on depression: Moderation by a polymorphism in the 5-HTT gene. Science, 301(5631), 386-389. Comuzzie, A.G., Hixson, J.E., et al. (1997). A major quantitative trait locus determining serum leptin levels and fat mass is located on human chromosome 2. Nature Genetics, 15(3), 273-276. Cox, D.R. (1984). Interaction. International Statistical Review, 51(1), 1-31. de Koning, D.J., and Haley, C.S. (2005). Genetical genomics in humans and model organisms. Trends in Genetics, 21(7), 377-381. Gibson, G., and Weir, B. (2005). The quantitative genetics of transcription. Trends in Genetics, 21(11), 616-623. Goldfarb, A., and Avraham, K.B. (2002). Genetics of deafness: Recent advances and clinical implications. Journal of Basic and Clinical Physiology and Pharmacology, 13(2), 75-88. GÃ¶ring, H.H.H., Curran, J.E., Johnson, M.P., Dyer, T.D., Charlesworth, J., Cole, S.A., Jowett, J.B.M., Abraham, L.J., Rainwater, D.L., Comuzzie, A.G., Mahaney, M.C., Almasy, K.L., MacCluer, J.W., Kissebah, A.H., Collier, G.R., Moses, E.K., and Blangero, J. (2007). Discovery of expression QTLs using large-scale transcriptional profiling in human lynphocytes. Nature Genetics, 39(10), 1208- 1216. GÃ¶ring, H.H.H., Duggirala, R., et al. (2004a). Genome-wide linkage analyses of human stature in pedigree samples from different ethnicities. Paper presented at the 73rd Annual Meeting of the American Association of Physical Anthropologists.

HARALD H.H. GÃRING 229 GÃ¶ring, H.H.H., Duggirala, R., et al. (2004b). Localization of genetic factors influencing height by genome-wide linkage analysis in large pedigree samples. Paper presented at the Xth In- ternational Congress of Auxology, Firenze, Italy. GÃ¶ring, H.H.H., Terwilliger, J.D., et al. (2001). Large upward bias in estimation of locus- specific effects from genomewide scans. American Journal of Human Genetics, 69(6), 1357-1369. Heckenlively, J.R., and Daiger, S.P. (1996). Hereditary retinal and choroidal degenerations. In D.L. Rimoin, J.M. Connor, and R.E. Pyeritz (Eds.), Emory and Rimoinâs principles and pracï¿½ tice of medical genetics (pp. 2555-2576). Edinburgh, Scotland: Churchill-Livingstone. Hirschhorn, J.N., Lindgren, C.M., et al. (2001). Genomewide linkage analysis of stature in multiple populations reveals several regions with evidence of linkage to adult height. American Journal of Human Genetics, 69(1), 106-116. Hovatta, I., Lichtermann, D., et al. (1998). Linkage analysis of putative schizophrenia gene candidate regions on chromosomes 3p, 5q, 6p, 8p, 20p and 22q in a population-based sampled Finnish family set. Molecular Psychiatry, 3(5), 452-457. Hovatta, I., Varilo, T., et al. (1999). A genomewide screen for schizophrenia genes in an iso- lated Finnish subpopulation, suggesting multiple susceptibility loci. American Journal of Human Genetics, 65(4), 1114-1124. Kulkarni, P.S., Butera, S.T., et al. (2003). Resistance to HIV-1 infection: Lessons learned from studies of highly exposed persistently seronegative (HEPS) individuals. AIDS Revealed, 5(2), 87-103. Martin, L.J., Comuzzie, A.G., et al. (2001). The utility of Bayesian model averaging for detect- ing known oligogenic effects. Genetic Epidemiology, 21(Suppl 1), S789-S793. Mitchell, B.D., Kammerer, C.M., et al. (1996). Genetic and environmental contributions to cardiovascular risk factors in Mexican Americans. The San Antonio Family Heart Study. Circulation, 94(9), 2159-2170. Pastinen, T., Ge, B., et al. (2006). Influence of human genome polymorphism on gene expres- sion. Human Molecular Genetics, 15(Spec. No. 1), R9-R16. Peltonen, L., Jalanko, A., et al. (1999). Molecular genetics of the Finnish disease heritage. Human Molecular Genetics, 8(10), 1913-1923. Penrose, L.S. (1935). The detection of autosomal linkage in data which consist of pairs of brothers and sisters of unspecified parentage. Annals of Eugenics 6, 133-138. Perola, M., Ohman, M., et al. (2001). Quantitative-trait-locus analysis of body-mass index and of stature, by combined analysis of genome scans of five Finnish study groups. American Journal of Human Genetics, 69(1), 117-123. Rivolta, C., Sharon, D., et al. (2002). Retinitis pigmentosa and allied diseases: Numerous diseases, genes, and inheritance patterns. Human Molecular Genetics, 11(10), 1219-1227. Sammalisto, S., Hiekkalinna, T., et al. (2005). A male-specific quantitative trait locus on 1p21 controlling human stature. Journal of Medical Genetics, 42(12), 932-939. Siegmund, D. (2002). Upward bias in estimation of genetic effects. American Journal of Human Genetics, 71(5), 1183-1188. Smith, M.W., and OâBrien, S.J. (2005). Mapping by admixture linkage disequilibrium: Ad- vances, limitations, and guidelines. Nature Reviews Genetics, 6(8), 623-632. Spielman, R.S., McGinnis, R.E., et al. (1993). Transmission test for linkage disequilibrium: The insulin gene region and insulin-dependent diabetes mellitus (IDDM). American Journal of Human Genetics, 52(3), 506-516. Sun, L., and Bull, S.B. (2005). Reduction of selection bias in genomewide studies by resam- pling. Genetic Epidemiology, 28(4), 352-367. Terwilliger, J.D., and GÃ¶ring, H.H. (2000). Gene mapping in the 20th and 21st centuries: Statistical methods, data analysis, and experimental design. Human Biology, 72(1), 63-132.

230 BIOSOCIAL SURVEYS Terwilliger, J.D., GÃ¶ring, H.H.H., et al. (2002). Study design for genetic epidemiology and gene mapping: The Korean Diaspora Project. Shengming Kexue Yanjiu (Life Science Reï¿½ search), 6, 95-115. Tilley, L., Morgan, K., et al. (1998). Genetic risk factors in Alzheimerâs disease. Molecular Pathology, 51(6), 293-304. Utz, H.F., Melchinger, A.E., et al. (2000). Bias and sampling error of the estimated proportion of genotypic variance explained by quantitative trait loci determined from experimen- tal data in maize using cross validation and validation with independent samples. Genetics, 154(3), 1839-1849. Weiss, K.M., and Terwilliger, J.D. (2000). How many diseases does it take to map a gene with SNPs? Nature Genetics, 26(2), 151-157. Wexler, N.S., Lorimer, J., et al. (2004). Venezuelan kindreds reveal that genetic and environ- mental factors modulate Huntingtonâs disease age of onset. Proceedings of the National Academy of Sciences, USA, 101(10), 3498-3503. Williams, J.T., and Blangero, J. (1999). Power of variance component linkage analysis to detect quantitative trait loci. Annals of Human Genetics, 63(Pt 6), 545-563. Wright, A.F., Carothers, A.D., et al. (1999). Population choice in mapping genes for complex diseases. Nature Genetics, 23(4), 397-404.

Next: 12 Overview Thoughts on Genetics: Walking the Line Between Denial and Dreamland, or Genes Are Involved in Everything, But Not Everything Is "Genetic"--Kenneth M. Weiss »

Biosocial Surveys (2008)

Chapter: 11 Comments on the Utility of Social Science Surveys for the Discovery and Validation of Genes Influencing Complex Traits--Harald H.H. Göring

Welcome to OpenBook!

Get Email Updates