Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
MAPPING HEREDITY: USING PROBABILISTIC MODELS AND ALGORITHMS TO MAP GENES AND GENOMES 35 genotype for every family member, including the precise parental chromosome from which each allele was inherited. Given complete information, it is usually easy to estimate the required parametersâfor example, the recombination frequency can be estimated by counting recombinant chromosomes, and the penetrance can be estimated by finding the proportion of individuals with a disease-predisposing genotype that manifests the disease. Unfortunately, one typically has only incomplete data Y from which it is difficult to estimate θ directly. The maximum likelihood estimate is the value that makes the observed data Y most likely to have occurred, that is, the value that maximizes the likelihood function, L(θ): = P(Y|θ). Using Bayes Theorem, one can calculate L(θ) by the equation: where the summation is taken over all possible values for the complete data X. The sum is easy to calculate in theory: it decomposes into various terms (corresponding to each individual and each genetic interval) that are conditionally independent, given complete data X. One can then calculate by numerical maximization methods. To determine whether is significantly different from a null value θ0 (for example, to see whether an estimated recombination frequency is significantly less than 50 percent), one examines the likelihood ratio Z = L( ) / L(θ0). If Z exceeds some appropriate threshold T, a statistically significant effect has been found. In principle, virtually any genetic problem can be treated by this approach. In practice, however, two important issues arise: (1) efficient algorithms and (2) statistical significance. Efficient Algorithms The number of terms in the sum L(θ) scales as roughly O(ecmn) where m is the number of people in the family, n is the number of genetic markers studied, and c is a constant. Except in the case of the smallest
MAPPING HEREDITY: USING PROBABILISTIC MODELS AND ALGORITHMS TO MAP GENES AND GENOMES 36 problems, it is infeasible to enumerate all the terms in the sum. Thus, it is a challenge even to calculate the likelihood L(θ) at a single point, let alone to find the value that maximizes the function. Considerable mathematical attention has been devoted to finding efficient ways to calculate L(θ). The first breakthrough was due to Elston and Stewart (1971), who devised an algorithm for computing the sum in a nested fashion from the youngest generations of a family to the oldest. The algorithm used the fact that the youngest generation is dependent only on its parents. The running time scales as O(mec'n), being exponential only in the number of genetic markers. This algorithm has been a mainstay of genetic calculations but becomes cumbersome when geneticists wish to employ ever-denser maps involving more genetic markers. Subsequently, Lander and Green (1987) developed an alternative algorithm based on hidden Markov models for nesting the sum according to the genetic markers used, starting with markers at one end of the map and working across to the other end. These authors also exploited the expectation-maximization algorithm from statistics (a well-known maximum likelihood approach) to aid in the multidimensional search for . The algorithm scales as O(nec"m), being linear in the number of markers although exponential in the number of family members. This algorithm has proven very useful for constructing dense genetic linkage maps, which involves studying many markers in families of limited size. But it is not practical for studying large disease pedigrees. Recently, mathematical geneticists have explored ways to approximate L(θ) by sampling from the sum. Modern techniques such as Gibbs sampling and importance sampling have been introduced in the past few years (Kong, 1991; Kong et al., 1992 a,b; Kong et al., 1993; Thompson and Wijsman, 1990). These methods exploit the fact that each piece of missing data depends only on local information in the pedigree: the probability distribution of the genotype of genetic marker i in individual j depends only on the genotypes of nearby markers in nearby individuals. Finding better ways to compute the likelihood function remains a problem from the standpoint of genetics and an excellent test bed for new statistical estimation techniques.