Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
MAPPING HEREDITY: USING PROBABILISTIC MODELS AND ALGORITHMS TO MAP GENES AND GENOMES 37 Statistical Significance In many genetic situations, one may search for a disease gene by estimating at many locations along the genome. When multiple comparisons are done, the threshold for statistical significance must be higher than the threshold for a single comparison. But how high should the threshold be? In principle, looking for the presence of a gene at every position along a continuous line involves infinitely many testsâalthough nearby points are clearly correlated. Surprisingly, the answer to this threshold question turns out to depend on relatively recent results from the theory of large deviations of diffusion processes. This idea is elaborated on in the next section, based on an example from recent work in our laboratory on susceptibility to colon cancer. Excursion: Susceptibility to Colon Cancer in Mice and the Large Deviation Theory of Diffusion Processes Colon cancer is one of the most prevalent malignancies in Western societies, with an estimated 145,000 new cases and 60,000 deaths per year in the United States alone. Although environmental factors such as diet can markedly influence the incidence of the disease, genetic factors are known to play a key role. Some families show striking clusters of colon cancer, with aggregations far beyond what could be explained by chance alone. Among such colon cancer families, there is a distinctive subtype called familial adenomatous polyposis (FAP), which is characterized by the fact that affected individuals develop a large number of intestinal growths called polyps that can become tumors. Genetic mapping studies (Bodmer et al., 1987; Leppert et al., 1987) showed that FAP was genetically linked to a region on the long arm of human chromosome 5; subsequently, physical mapping studies led to the isolation of the responsible gene, named APC (Groden et al., 1991; Kinzler et al., 1991; Nishisho et al., 1991). One way to study the role of APC in tumorigenesis is to turn to biochemistry, in an effort to understand the cellular components with which the protein product interacts. Another way is to turn back to genetics for further insight. One observation about FAP families is that individuals inheriting precisely the same APC mutation may be affected to
MAPPING HEREDITY: USING PROBABILISTIC MODELS AND ALGORITHMS TO MAP GENES AND GENOMES 38 different degrees. What is the reason for the variability in the manifestation of the disease? Is it due to environment or to the effects of other genes? If the latter, then finding such modifying genes could shed light on the process by which colon cancer develops. By the usual scientific serendipity, animal studies turned out to hold an important clue. In 1990, William Dove's laboratory at the University of Wisconsin was performing mutagenesis experiments and identified a mouse that spontaneously developed colon tumors (Moser et al., 1990). The dominantly acting mutation responsible for the trait was named Min (for multiple intestinal neoplasia). After considerable genetic mapping and cloning, Dove and his colleagues showed that Min was in fact a mutation in the mouse version of the APC gene (Su et al., 1992). The Min mouse thus provided a model of human colon cancer and, in particular, a way to look for other genes that might suppress the development of colon tumors. The Min mutation is usually maintained in a heterozygous state on a mouse strain called B6, and such B6 Min/+ mice typically develop about 30 intestinal tumors and die by 3 to 4 months of age. When Dove and his colleagues crossed this mouse to another mouse strain called AKR, they got a surprising result: the F1 Min/+ progeny developed many fewer colon tumors. On average, the F1 mice developed about six tumors and most did not die from them. Somehow, the AKR strain must have contributed alleles at one or more genes that substantially modified the effects of Min. Dove's laboratory and our laboratory decided to collaborate to try to map the modifying genes (Dietrich et al., 1993). A backcross was arranged in which the F1 progeny were mated back to the more susceptible B6 strain (Figure 2.4). For any modifier locus, 50 percent of the progeny should inherit one copy of the suppressing allele from the AKR strain (that is, have genotype AB) and 50 percent should be homozygous for the nonsuppressing allele from B6 (that is, have genotype BB). Each animal inheriting the Min mutation was scored for its phenotype by dissecting the intestine and counting the number of tumors and for its genotype by typing the mice for a dense map of DNA polymorphisms that had been constructed in our laboratory (Dietrich et al., 1992). The complete data for animal i consists of two parts: phenotype i and a continuous function gi(x) indicating the genotypeâwhich is either AB or ABâat each position along the chromosome (Figure 2.5). Actually,
MAPPING HEREDITY: USING PROBABILISTIC MODELS AND ALGORITHMS TO MAP GENES AND GENOMES 39 Figure 2.4 Distribution of colon tumors caused by the Min mutation. Mice from the B6 strain carrying the genotype Min/+ develop about 30 tumors on average. When these mice are crossed to the AKR strain, the resulting F, progeny develop only about 6 tumors. When the F, progeny are crossed back to the B6 strain, the resulting backcross progeny show a wide distribution in tumor number. (A) Design of cross. (B) Scatterplot of tumor numbers from different generations in the cross.
MAPPING HEREDITY: USING PROBABILISTIC MODELS AND ALGORITHMS TO MAP GENES AND GENOMES 40 the problem is slightly more complicated because one can only observe the genotype at the location of the DNA polymorphisms studied. However, for this discussion, the map can be assumed to be so dense that the data are essentially continuous. It can also be assumed that the number n of progeny is very large. Figure 2.5 Schematic representation of data for genetic analysis of quantitative traits in a backcross. Every offspring (i = 1,2,. . ., n) has a phenotype that is a continuous variable f and a genotype at every position in the genome. The genotype gi (x) at position x has two possible states in a backcross (homozygous or heterozygous, encoded as 0 or 1 and represented by black or white in the figure). The figure illustrates the case where the phenotype might depend on two quantitative trait loci (QTL1 and QTL2), according to a linear model = a1 g1 + a2g2+ ε, where g1 is the genotype at QTLp, the a1 are constants, and e is a normal random variable. At every position x along the chromosome, the animals can be divided into two sets according to their genotype: AB(x) = {animal i | gi (x) = AA} and BB(x) = {animal i | gi (x) = BA}. If a major modifier gene occurs at location x*, then the animals in AB(x*) should have many fewer tumors than the animals in BB(x). One could thus perform a t-test (the usual two sample t statistic based on the number of tumors per animal in the two groups) at every position along the
MAPPING HEREDITY: USING PROBABILISTIC MODELS AND ALGORITHMS TO MAP GENES AND GENOMES 41 chromosome to find a region where the t-statistic Z exceeds some critical threshold T. How high a threshold is needed to ensure statistical significance, if one scans the entire genome? If for a single chromosome there is no modifying gene along the chromosome, the t-statistic Z(x) at any given point x should be normally distributed with mean 0. It is thus easy to determine the appropriate significance level for the single test at x. But we need to know about the distribution of max Z(x), where the maximum is taken over the entire chromosome. This question belongs to the field of Gaussian processes. A family of variables {Y(x), a ⤠x ⤠b} is called a Gaussian process if for each n = 1,2,. . . and each x1 < x2 <. . .<xn, the random variables Z(x1),Z(x2),. . .,Z(xn) are jointly normally distributed. A Gaussian process is specified by its mean µ(t) = E(Z(t)) and its covariance C(s,t) = cov(Z(s),Z( t)). An important example is the ''Ornstein-Uhlenbeck process," in which µ(t) = 0 and C(s,t) = eâÃ|sât|. The Ornstein-Uhlenbeck process arises naturally in physics, because it describes the behavior of a particle undergoing Brownian motion trapped in a potential well. In recent years, Gaussian processes have been a subject of considerable mathematical interest, and the large deviation theory has been worked out for many cases, including the Ornstein-Uhlenbeck process. Interestingly, it is not hard to show that the statistic Z(x) in our genetic example also follows an Ornstein- Uhlenbeck process with à = 2. (The mean is 0, and the covariance follows essentially from the Haldane mapping function mentioned above.) Using recent mathematical results (Feingold et al., 1993; Lander and Botstein, 1989), one can thus show that, for large t, P{max 0â¤xâ¤G Z(x) ⥠t} ~ (C + 2Gt2)(1 â Φ(t)), where Φ (t) is the standard normal cumulative distribution function, C is the number of chromosomes, and G is the length of the genome in morgans. In short, the probability of exceeding threshold t somewhere along a genome of length G is larger by a factor of about 2Gt2 than the probability of exceeding it at a single point.