Chapter 4

Invited Session on Record Linkage Methodology

Chair: Nancy Kirkendall, Office of Management and Budget

Authors:

Thomas R.Belin, University of California—Los Angeles, and Donald B.Rubin, Harvard University

Michael D.Larsen, Harvard University

Fritz Scheuren, Ernst and Young, LLP and William E.Winkler, Bureau of the Census



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Chapter 4 Invited Session on Record Linkage Methodology Chair: Nancy Kirkendall, Office of Management and Budget Authors: Thomas R.Belin, University of California—Los Angeles, and Donald B.Rubin, Harvard University Michael D.Larsen, Harvard University Fritz Scheuren, Ernst and Young, LLP and William E.Winkler, Bureau of the Census

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition This page in the original is blank.

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition A Method for Calibrating False-Match Rates in Record Linkage* Thomas R.Belin, UCLA and Donald B.Rubin, Harvard University Specifying a record-linkage procedure requires both (1) a method for measuring closeness of agreement between records, typically a scalar weight, and (2) a rule for deciding when to classify records as matches or non matches based on the weights. Here we outline a general strategy for the second problem, that is, for accurately estimating false-match rates for each possible cutoff weight. The strategy uses a model where the distribution of observed weights are viewed as a mixture of weights for true matches and weights for false matches. An EM algorithm for fitting mixtures of transformed-normal distributions is used to find posterior modes; associated posterior variability is due to uncertainty about specific normalizing transformations as well as uncertainty in the parameters of the mixture model, the latter being calculated using the SEM algorithm. This mixture-model calibration method is shown to perform well in an applied setting with census data. Further, a simulation experiment reveals that, across a wide variety of settings not satisfying the model's assumptions, the procedure is slightly conservative on average in the sense of overstating false-match rates, and the one-sided confidence coverage (i.e., the proportion of times that these interval estimates cover or overstate the actual false-match rate) is very close to the nominal rate. KEY WORDS: Box-Cox transformation; Candidate matched pairs; EM algorithm; Mixture model; SEM algorithm; Weights. 1. AN OVERVIEW OF RECORD LINKAGE AND THE PROBLEM OF CALIBRATING FALSE-MATCH RATES 1.1 General Description of Record Linkage Record linkage (or computer matching, or exact matching) refers to the use of an algorithmic technique to identify records from different data bases that correspond to the same individual. Record-linkage techniques are used in a variety of settings; the current work was formulated and first applied in the context of record linkage between the census and a large-scale postenumeration survey (the PES), which comprises the first step of an extensive matching operation conducted to evaluate census coverage for subgroups of the population (Hogan 1992). The goal of this first step is to declare as many records as possible “matched” without an excessive rate of error, thereby avoiding the cost of the resulting manual processing for all records not declared “matched.” Specifying a record-linkage procedure requires both a method for measuring closeness of agreement between records and a rule using this measure for deciding when to classify records as matches. Much attention has been paid in the record-linkage literature to the problem of assigning “weights” to individual fields of information in a multivariate record and obtaining a “composite weight” that summarizes the closeness of agreement between two records (see, for example, Copas and Hilton 1990; Fellegi and Sunter 1969; Newcombe 1988; and Newcombe, Kennedy, Axford, and James 1959). Somewhat less attention has been paid to the problem of deciding when to classify records as matches, although various approaches have been offered by Tepping (1968), Fellegi and Sunter (1969), Rogot, Sorlie, and Johnson (1986), and Newcombe (1988). Our work focuses on the second problem by providing a predicted probability of match for two records, with associated standard error, as a function of the composite weight. The context of our problem, computer matching of census records, is typical of record linkage. After data collection, preprocessing of data, and determination of weights, the next step is the assignment of candidate matched pairs where each pair of records consists of the best potential match for each other from the respective data bases (cf. “hits” in Rogot, Sorlie, and Johnson 1986; “pairs” in Winkler 1989; “assigned pairs” in Jaro 1989). According to specified rules, a scalar weight is assigned to each candidate pair, thereby ordering the pairs. The final step of the record linkage procedure is viewed as a decision problem where three actions are possible for each candidate matched pain declare the two records matched, declare the records not matched, or send both records to be reviewed more closely (see, for example, Fellegi and Sunter 1969). In the motivating problem at the U.S. Census Bureau, a binary choice is made between the alternatives “declare matched” versus “send to followup,” although the matching procedure attempts to draw distinctions within the latter group to make manual matching easier for follow-up clerks. In such a setting, a cutoff weight is needed above which records are declared matched; the false-match rate is then defined as the number of falsely matched pairs divided by the number of declared matched pairs. Particularly relevant for any such decision problem is an accurate method for assessing the probability that a candidate matched pair is a correct match as a function of its weight. 1.2 The Need for Better Methods of Classifying Records as Matches or Nonmatches Belin (1989a, 1989b, 1990) studied various weighting procedures (including some suggested by theory, some used in practice, and some new simple ad hoc weighting schemes) in the census matching problem and reached three primary * Thomas R.Belin is Assistant Professor. Department of Biostatistics. UCLA School of Public Health. Los Angeles. CA, 90024. Donald B.Rubin is Professor. Department of Statistics. Harvard University. Cambridge, MA, 02138. The authors would like to thank William Winkler of the U.S. Census Bureau for a variety of helpful discussions. The authors also gratefully acknowledge the support of Joint Statistical Agreements 89–07, 90–23, and 91– 08 between the Census Bureau and Harvard University, which helped make this research possible. Much of this work was done while the first author was working for the Record Linkage Staff of the Census Bureau: the views expressed are those of the authors and do not necessarily reflect those of the Census Bureau.

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition conclusions. First, different weighting procedures lead to comparable accuracy in computer matching. Second, as expected logically and from previous work (e.g., Newcombe 1988, p. 144), the false-match rate is very sensitive to the setting of a cutoff weight above which records will be declared matched. Third, and more surprising, current methods for estimating the false-match rate are extremely inaccurate, typically grossly optimistic. To illustrate this third conclusion, Table 1 displays empirical findings from Belin (1990) with test-census data on the performance of the procedure of Fellegi and Sunter (1969), which relies on an assumption of independence of agreement across fields of information. That the Fellegi-Sunter procedure for estimating false-match rates does not work well (i. e., is poorly calibrated) may not be so surprising in this setting, because the census data being matched do not conform well to the model of mutual independence of agreement across the fields of information (see, for example, Kelley 1986 and Thibaudeau 1989). Other approaches to estimating false-match rates that rely on strong independence assumptions (e.g., Newcombe 1988) can be criticized on similar grounds. Although the Fellegi-Sunter approach to setting cutoff weights was originally included in census/PES matching operations (Jaro 1989), in the recent past (including in the 1990 Census) the operational procedure for classifying record pairs as matches has been to have a human observer establish cutoff weights manually by “eyeballing ” lists of pain of records brought together as candidate matches. This manual approach is easily criticized, both because the error properties of the procedure are unknown and variable and because, when linkage is done in batches at different times or by different persons, inconsistent standards are apt to be applied across batches. Another idea is to use external data to help solve this classification problem. For example, Rogot, Sorlie, and Johnson (1986) relied on extreme order statistics from pilot data to determine cutoffs between matches and nonmatches; but this technique can be criticized, because extreme order statistics may vary considerably from sample to sample, especially when sample sizes are not large. One other possibility, discussed by Tepping (1968), requires clerical review of samples from the output of a record-linkage procedure to provide Table 1. Performance of Fellegi-Sunter Cutoff Procedure on 1986 Los Angeles Test-Census Data Acceptable false-match rate specified by the user of matching program Observed false-match rate among declared matched pairs .05 .0627 .04 .0620 .03 .0620 .02 .0619 .01 .0497 10−3 .0365 10−4 .0224 10−5 .0067 10−6 .0067 10−7 .0067 feedback on error rates to refine the calibration procedure. Such feedback is obviously desirable, but in many applications, including the census/PES setting, it is impossible to provide it promptly enough. A more generally feasible strategy is to use the results of earlier record-linkage studies in which all candidate matched pairs have been carefully reviewed by clerks. This type of review is common practice in operations conducted by the Census Bureau. Each such training study provides a data set in which each candidate pair has its weight and an outcome, defined as true match or false match, and thus provides information for building a model to give probability of match as a function of weight. 1.3 A Proposed Solution to the Problem of Calibrating Error Rates There are two distinct approaches to estimating the relationship between a dichotomous outcome, Zi = 1 if match and Zi = 0 if nonmatch, from a continuous predictor, the weight, Wi: the direct approach, typified by logistic regression, and the indirect approach, typified by discriminant analysis. In the direct approach, an iid model is of the form f(Zi|Wi, ν) × g(Wi|ζ), where g(Wi|ζ), the marginal distribution of Wi, is left unspecified with ζ a priori independent of ν. In the indirect approach, the iid model is of the form h(Wi|Zi,φ)[λZi(1 − λ)(1−Zi)], where the first factor specifies, for example, a normal conditional distribution of Wi for Zi = 0 and for Zi = 1 with common variance but different means, and the second factor specifies the marginal probability of Zi = 1, λ, which is a priori independent of φ. Under this approach, P(Zi|Wi) is found using Bayes's theorem from the other model specifications as a function of φ and λ. Many authors have discussed distinctions between the two approaches, including Halperin, Blackwelder, and Verter (1971), Mantel and Brown (1974), Efron (1975), and Dawid (1976). In our setting, application of the direct approach would involve estimating f(Zi|Wi, ν) in observed sites where determinations of clerks had established Zi, and then applying the estimated value of ν to the current site with only Wi observed to estimate the probability of match for each candidate pair. If the previous sites differed only randomly from the current sites, or if the previous sites were a subsample of the current data selected on Wi, then this approach would be ideal Also, if there were many previous sites and each could be described by relevant covariates, such as urban/ rural and region of the country, then the direct approach could estimate the distribution of Z as a function of W and covariates and could use this for the current site. Limited experience of ours and of our colleagues at the Census Bureau, who investigated this possibility using 1990 Census data, has resulted in logistic regression being rejected as a method for estimating false-match rates in the census setting (W.E.Winkler 1993, personal communication). But the indirect approach has distinct advantages when, as in our setting, there can be substantial differences among sites that are not easily modeled as a function of covariates and we have substantial information on the distribution of weights given true and false matches, h(• | •). In particular,

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Figure 1. Histograms of Weights for True and False Matches by Site: (a) St. Louis; (b) Columbia; (c) Washington. with the indirect approach, the observed marginal distribution of Wi in the current site is used to help estimate P(Zi|Wi) in this site, thereby allowing systematic site to site differences in P(Zi|Wi). In addition, there can be substantial gains in efficiency using the indirect approach when normality holds (Efron 1975), especially when h(Wi|Zi = 1, φ) and h(Wi|Zi = 0, φ) are well separated; (that is, when the number of standard deviations difference between their means is large). Taking this idea one step further, suppose that previous validated data sets had shown that after a known transformation, the true-match weights were normally distributed, and that after a different known transformation, the false-match weights were normally distributed. Then, after inverting the transformations, P(Zi|Wi) could be estimated in the current site by fitting a normal mixture model, which would estimate the means and variances of the two normal components (i.e., φ) and the relative frequency of the two components (i.e., λ), and then applying Bayes's theorem. In this example, instead of assuming a common P(Zi|Wi) across all sites, only the normality after the fixed transformations would be assumed common across sites. If there were many sites with covariate descriptors, then (λ, φ) could be modeled as a function of these, for example, a linear model structure on the normal means. To illustrate the application of our work, we use available test-census data consisting of records from three separate sites of the 1988 dress rehearsal Census and PES: St. Louis, Missouri, with 12,993 PES records; a region in East Central Missouri including Columbia, Missouri, with 7,855 PES records; and a rural area in eastern Washington state, with only 2,318 records. In each site records were reviewed by clerks, who made a final determination as to the actual match status of each record; for the purpose of our discussion, the clerks ' determinations about the match status of record pairs are regarded as correct The matching procedures used in the 1988 test Census have been documented by Brown et al. (1988), Jaro (1989), and Winkler (1989). Beyond differences in the sizes of the PES files, the types of street addresses in the areas offer considerably different amounts of information for matching purposes; for instance, rural route addresses, which were common in the Washington site but almost nonexistent in the St. Louis site, offer less information for matching than do most addresses commonly found in urban areas. Figure 1 shows histograms of both true-match weights and false-match weights from each of the three sites. The bimodality in the true-match distribution for the Washington site appears to be due to some record pairs agreeing on address information and some not agreeing. This might generate concern, not so much for lack of fit in the center of the distribution as for lack of fit in the tails, which are essential to false-match rate estimation. Of course, it is not surprising that validated data dispel the assumption of normality for true-match weights and false-match weights. They do, however—at least at a coarse level in their apparent skewness— tend to support the idea of a similar nonnormal distributional shape for true-match weights across sites as well as a similar nonnormal distributional shape for false-match weights across sites. Moreover, although the locations of these distributions change from she to site, as do the relative frequencies of the true-match to the false-match components, the relative spread of the true to false components is similar across sites.

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition These observations lead us to formulate a transformed-normal mixture model for calibrating false-match rates in record-linkage settings. In this model, two power (or Box-Cox) transformations are used to normalize the false-match weights and the true-match weights, so that the observed raw weights in a current setting are viewed as a mixture of two transformed normal observations. Mixture models have been used in a wide variety of statistical applications (see Titterington, Smith, and Makov 1985, pp. 16–21, for an extensive bibliography). Power transformations are also used widely in statistics, prominently in an effort to satisfy normal theory assumptions in regression settings (see, for example, Weisberg 1980, pp. 147–151). To our knowledge, neither of these techniques has been utilized in record linkage operations, nor have mixtures of transformed-normal distributions, with different transformations in the groups, appeared previously in the statistical literature, even though this extension is relatively straightforward. The most closely related effort to our own of which we are aware is that of Maclean, Morton, Elston, and Yee (1976), who used a common power transformation for different components of a mixture model, although their work focused on testing for the number of mixture components. Section 2 describes the technology for fitting mixture models with components that are normally distributed after application of a power transformation, which provides the statistical basis for the proposed calibration method. This section also outlines the calibration procedure itself, including the calculation of standard errors for the predicted false-match rate. Section 3 demonstrates the performance of the method in the applied setting of matching the Census and PES, revealing it to be quite accurate. Section 4 summarizes a simulation experiment to gauge the performance of the calibration procedure in a range of hypothetical settings and this too supports the practical utility of the proposed calibration approach. Section 5 concludes the article with a brief discussion. 2. CALIBRATING FALSE-MATCH RATES IN RECORD LINKAGE USING TRANSFORMED-NORMAL MIXTURE MODELS 2.1 Strategy Based on Viewing Distribution of Weights as Mixture We assume that a univariate composite weight has been calculated for each candidate pair in the record-linkage problem at hand, so that the distribution of observed weights is a mixture of the distribution of weights for true matches and the distribution of weights for false matches. We also assume the availability of at least one training sample in which match status (i.e., whether a pair of records is a true match or a false match) is known for all record pairs. In our applications, training samples come from other geographical locations previously studied. We implement and study the following strategy for calibrating the false-match rate in a current computer-matching problem: Use the training sample to estimate “global” parameters, that is, the parameters of the transformations that normalize the true- and false-match weight distributions and the parameter that gives the ratio of variances between the two components on the transformed scale. The term “global” is used to indicate that these parameters are estimated by data from other sites and are assumed to be relatively constant from site to site, as opposed to “site-specific ” parameters, which are assumed to vary from site to site and are estimated only by data from the current site. Fix the values of the global parameters at the values estimated from the training sample and fit a mixture of transformed-normal distributions to the current site's weight data to obtain maximum likelihood estimates (MLE's) and associated standard errors of the component means, component variances, and mixing proportion. We use the EM algorithm (Dempster, Laird, and Rubin 1977) to obtain MLE 's and the SEM algorithm (Meng and Rubin 1991) to obtain asymptotic standard errors. For each possible cutoff level for weights, obtain a point estimate for the false-match rate based on the parameter estimates from the model and obtain an estimate of the standard error of the false-match rate. In calculating standard errors, we rely on a large-sample approximation that makes use of the estimated covariance matrix obtained from the SEM algorithm. An appealing modification of this approach, which we later refer to as our “full strategy,” reflects uncertainty in global parameters through giving them prior distributions. Then, rather than fixing the global parameters at their estimates from the training sample, we can effectively integrate over the uncertainty in the global parameters by modifying Step 2 to be: 2'. Draw values of the global parameters from their posterior distribution given training data, fix global parameters at their drawn values, and fit a mixture of transformed-normal distributions to the current weight data to obtain MLE's (and standard errors) of site-specific parameters; and adding: 4. Repeat Steps 2' and 3 a few or several times, obtaining false-match rate estimates and standard errors from each repetition, and combine the separate estimates and standard errors into a single point estimate and standard error that reflect uncertainty in the global parameters using the multiple imputation framework of Rubin (1987). We now describe how to implement each of these steps. 2.2 Using a Training Sample to Estimate Global Parameters Box and Cox (1964) offered two different parameterizations for the power family of transformations: one that ignores the scale of the observed data, and the other—which we will use—that scales the transformations by a function of the observed data so that the Jacobian is unity. We denote the family of transformations by (1)

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition where ω is the geometric mean, of the observations w1, . . . . wn. By “transformed-normal distribution,” we mean that for some unknown values of γ and ω, the transformed observations ψ(wi; γ, ω) (i = 1, . . ., n) are normally distributed. Although the sample geometric mean is determined by the data, we will soon turn to a setting involving a mixture of two components with different transformations to normality, in which even the sample geometric means of the two components are unknown; consequently, we treat a as an unknown parameter, the population geometric mean. When the transformations are not scaled by the geometric-mean factor, as Box and Cox (1964, p. 217) noted, “the general size and range of the transformed observations may depend strongly on [γ].” Of considerable interest in our setting is that when transformations are scaled, not only are the likelihoods for different values of γ directly comparable, at least asymptotically, but also, by implication, so are the residual sums of squares on the transformed scales for different values of γ. In other words, scaling the transformations by ωγ−1 has the effect asymptotically of unconfounding the estimated variance on the transformed scale from the estimated power parameter. This result is important in the context of fitting mixtures of transformed-normal distributions when putting constraints on component variances in the fitting of the mixture model; by using scaled transformations, we can constrain the variance ratio without reference to the specific power transformation that has been applied to the data Box and Cox (1964) also considered an unknown location parameter in the transformation, which may be needed because power transformations are defined only for positive random variables. Because the weights that arise from recordlinkage procedures are often allowed to be negative, this issue is relevant in our application. Nevertheless, Belin (1991) reported acceptable results using an ad hoc linear transformation of record-linkage weights to a range from 1 to 1,000. Although this ad hoc shift and rescaling is assumed to be present, we suppress the parameters of this transformation in the notation. In the next section we outline in detail a transformed-normal mixture model for record-linkage weights. Fitting this model requires separate estimates of γ and ω for the true-match and false-match distributions observed in the training data, as well as an estimate of the ratio of variances on the transformed scale. The γ's can, as usual, be estimated by performing a grid search of the likelihoods or of the respective posterior densities. A modal estimate of the variance ratio can be obtained as a by-product of the estimation of the γ's. We also obtain approximate large-sample variances by calculating for each parameter a second difference as numerical approximation to the second derivative of the loglikelihood in the neighborhood of the maximum (Belin 1991). In our work we have simply fixed the ω's at their sample values, which appeared to be adequate based on the overall success of the methodology on both real and simulated data; were it necessary to obtain a better fit to the data, this approach could be modified. 2.3 Fitting Transformed Normal Mixtures with Fixed Global Parameters 2.3.1 Background on Fitting Normal Mixtures Without Transformations. Suppose that f1 and f2 are densities that depend on an unknown parameter φ, and that the density f is a mixture of f1 and f2, i.e., f(X|φ, λ), = λf1(X|φ) + (1 − λ) f2(X|φ) for some λ between 0 and 1. Given an iid sample (X1, X2. . . , Xn) from f(X|φ, λ), the likelihood of θ = (φ, λ) can then be written as Following the work of many authors (e.g.. Aitkin and Rubin 1985; Dempster et al. 1977; Little and Rubin 1987; Orchard and Woodbury 1972; Titterington et al. 1985), we formulate the mixture model in terms of unobserved indicators of component membership Zi, i = 1, ..., n, where Zi = 1 if Xi comes from component 1 and Zi = 0 if Xi comes from component 2. The mixture model can then be expressed as a hierarchical model, The “complete-data” likelihood, which assumes that the “missing data” Z1, . . . , Zn are observed, can be written as L(φ, λ|X1, . . . , Xn; Z1, . . . , Zn) Viewing the indicators for component membership as missing data motivates the use of the EM algorithm to obtain MLE's of (φ, λ). The E step involves finding the expected value of the Zi's given the data and current parameter estimates φ(t) and λ(t), where t indexes the current iteration. This is computationally straightforward both because the iid structure of the model implies that Zi is conditionally independent of the rest of the data given Xi and because the Zi's are indicator variables, so the expectation of Zi is simply the posterior probability that Zi equals 1. Using Bayes's theorem, the E step at the (t + 1)st iteration thus involves calculating (2) for i = 1, . . . , n. The M step involves solving for MLE's of θ in the “complete-data” problem. In the case where f1 corresponds to the distribution and f2 corresponds to the distribution, so that the M step at

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition iteration (t + 1) involves calculating (3) and (4) The updated value of λ at the (t + 1)st iteration is given by (5) which holds no matter what the form of the component densities may be. Instabilities can arise in maximum likelihood estimation for normally distributed components with distinct variances, because the likelihood is unbounded at the boundary of the parameter space where either Unless the starting values for EM are near a local maximum of the likelihood, EM can drift toward the boundary where the resulting fitted model suggests that one component consists of any single observation (with zero variance) and that the other component consists of the remaining observations (Aitkin and Rubin 1985). When a constraint is placed on the variances of the two components, EM will typically converge to an MLE in the interior of the parameter space. Accordingly, a common approach in this setting is to find a sensible constraint on the variance ratio between the two components or to develop an informative prior distribution for the variance ratio. When the variance ratio is assumed fixed, the E step proceeds exactly as in (2) and the M step for and is given by (3); the M step for the scale parameters with fixed V is (6) 2.3.2 Modifications to Normal Mixtures for Distinct Transformations of the Two Components. We now describe EM algorithms for obtaining MLE's of parameters in mixtures of transformed-normal distributions, where there are distinct transformations of each component. Throughout the discussion, we will assume that there are exactly two components; fitting mixtures of more than two components involves straightforward extensions of the arguments that follow (Aitkin and Rubin 1985). We will also assume that the transformations are fixed; that is, we assume that the power parameters (the two γi's) and the “geometric-mean” parameters (the two ωi's) are known in advance and are not to be estimated from the data. We can write the model for a mixture of transformed-normal components as follows: where and the expression “Transformed-N” with four arguments refers to the transformed-normal distribution with the four arguments being the location, scale, power parameter, and “geometric-mean” parameter of the transformed-normal distribution. The complete-data likelihood can be expressed as L(θ|X1, . . . , Xn; Z1, . . . , Zn) where J1 and J2 are the Jacobians of the scaled transformations X → ψ. If ω1 and ω2 were not fixed a priori but instead were the geometric means of the Xi for the respective components, then J1 = J2 = 1. In our situation, however, because the Zi's are missing, J1 and J2 are functions of {Xi}, {Zi}, and θ, and are not generally equal to 1. Still, J1 and J2 are close to 1 when the estimated geometric mean of the sample Xi in component k is close to ωk. We choose to ignore this minor issue; that is, although we model ω1 and ω2 as known from prior considerations, we still assume J1 = J2 = 1. To do otherwise would greatly complicate our estimation procedure with, we expect, no real benefit; we do not blindly believe such fine details of our model in any case, and we would not expect our procedures to be improved by the extra analytic work and computational complexity. To keep the distinction clear between the parameters assumed fixed in EM and the parameters being estimated in EM, we partition the parameter into where and and where the variance ratio Based on this formulation, MLE's of can be obtained from the following EM algorithm: E step. For i = 1, . . ., n, calculate as in (2), where (7) M step. Calculate and as in (3), λ(t+1) as in (5), and and as in (6), with Xi replaced by ψ(Xi; γg, ωg) for g = 1, 2; if the variance ratio V were not fixed but were to be estimated, then (4) would be used in place of (6). 2.3.3 Transformed-Normal Mixture Model for Record-Linkage Weights. Let the weights associated with record pairs in a current data set be denoted by Wi, i = 1, . . . , n,

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition where as before Zi = 1 implies membership in the false-match component and Zi = 0 implies membership in the true-match component. We assume that we have already obtained, from a training sample, (a) values of the power transformation parameters, denoted by γF for the false-match component and by γT for the true-match component, (b) values of the “geometric mean” parameters in the transformations, denoted by ωF for the false-match component and by ωT for the true-match component, and (c) a value for the ratio of the variances between the false-match and true-match components, denoted by V. Our model then becomes where . We work with and The algorithm of Section 2.3.2, with “F” and “T” substituted for “1” and “2,” describes the EM algorithm for obtaining MLE's of from {Wi; i = 1, . . . , n} with {Zi; i = 1, . . . , n} missing and global parameters γF, γT, ωF, ωT, and V fixed at specified values. 2.4 False-Match Rate Estimates and Standard Errors with Fixed Global Parameters 2.4.1 Estimates of the False-Match Rate. Under the transformed-normal mixture model formulation, the false-match rate associated with a cutoff C can be expressed as a function of the parameters θ as (8) Substitution of MLE's for the parameters in this expression provides a predicted false-match rate associated with cutoff C. Because there is a maximum possible weight associated with perfect agreement in most record-linkage procedures, one could view the weight distribution as truncated above. According to this view, the contribution of the tail above the upper truncation point (say, B), should be discarded by substituting Φ([ψg(B; γg, ωg,) − μg]/σg) for the 1s inside the bracketed expressions (g = F, T as appropriate). Empirical investigation suggests that truncation of the extreme upper tail makes very little difference in predictions. The results in Sections 3 and 4 reflect false-match rate predictions without truncation of the extreme upper tail. 2.4.2 Obtaining an Asymptotic Covariance Matrix for Mixture-Model Parameters From SEM Algorithm. The SEM algorithm (Meng and Rubin 1991) provides a method for obtaining standard errors of parameters in models that are fit using the EM algorithm. The technique uses estimates of the fraction of missing information derived from successive EM iterates to inflate the complete-data variance-covariance matrix to provide an appropriate observed-data variance-covariance matrix. Details on the implementation of the SEM algorithm in our mixture-model setting are deferred to the Appendix. Standard arguments lead to large-sample standard errors for functions of parameters. For example, the false-match rate e(C|θ) can be expressed as a function of the four components of by substituting for σT in (8). Then the squared standard error of the estimated false-match rate is given by SE2 (e) ≈ dT Ad, where A is the covariance matrix for obtained by SEM and the vth component of d is 2.4.3 Estimates of the Probability of False Match for a Record Pair With a Given Weight. The transformed-normal mixture model also provide a framework far estimating the probability of false match associated with various cutoff weights. To be clear, we draw a distinction between the “probability of false match” and what we refer to as the “neighborhood false-match rate” to avoid any confusion caused by (1) our using a continuous mixture distribution to approximate the discrete distribution of weights associated with a finite number of record pairs, and (2) the fact that there are only finitely many possible weights associated with many record-linkage weighting schemes. The “neighborhood false-match rate around W” is the number of false matches divided by the number of declared matches among pairs of records with composite weights in a small neighborhood of W; with a specific model, the neighborhood false-match rate is the “probability of false match” implied by the relative density of the true-match and false-match components at W. In terms of the mixture-model parameters, the false-match rate among record pairs with weights between W and W + h is given by where g=F, T, and . Although the number of false matches is not a smooth function of the number of declared matches, ξ(W, h|θ) is a smooth function

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition of h. The probability of false match under the transformed-normal mixture model is the limit as h→ 0 of ξ(W, h|θ), which we denote as η(W|θ); we obtain (9) where g = F, T. Estimates of neighborhood false-match rates are thus routinely obtained by substituting the fixed global parameter values and MLE's of μF, μT, σF, σT, and λ into (9). Because the neighborhood false-match rate captures the trade-off between the number of false matches and the number of declared matches, the problem of setting cutoffs can be cast in terms of the question “Approximately how many declared matches are needed to make up for the cost of a false match?” If subject matter experts who are using a record-linkage procedure can arrive at an answer to this question, then a procedure for setting cutoffs could be determined by selecting a cutoff weight where the estimated neighborhood false-match rate equals the appropriate ratio. Alternatively, one could monitor changes in the neighborhood false-match rate (instead of specifying a “tolerable ” neighborhood false-match rate in advance) and could set a cutoff weight at a point just before the neighborhood false-match rate accelerates. 2.5 Reflecting Uncertainty in Global Parameters When there is more than one source of training data, the information available about both within-site and between-site variability in global parameters can be incorporated into the prior specification. For example, with two training sites, we could combine the average within-site variability in a global parameter with a 1 df estimate of between-site variability to represent prior uncertainty in the parameter. With many sites with covariate descriptors, we could model the multivariate regression of global parameters on covariates. The procedure we used in the application to census/PES data offers an illustration in the simple case with two training sites available to calibrate a third site. For each of the training-data sites and each of the components (true-match and false-match), joint MLE's were found for g = F, T, using a simple grid-search over the power parameters. This yielded two estimates of the power parameters, γF and γT, and two estimates of the variance ratio V between the false-match and true-match components. Additionally, an estimated variance-co variance matrix for these three parameters was obtained by calculating second differences of the loglikelihood at grid points near the maximum. Values of each parameter for the mixture-model fitting were drawn from separate truncated-normal distributions with mean equal to the average of the estimates from the two training sites and variance equal to the sum of the squared differences between the individual site parameter values and their mean (i.e., the estimated “between ” variance), plus the average squared standard error from the two prior fittings (i.e., the average “within” variance). The truncation ensured that the power parameter for the false-match component was less than i, that the power parameter for the true-match component was greater than 1, and that the variance ratio was also greater than 1. These constraints on the power parameters were based on the view that because there is a maximum possible weight corresponding to complete agreement and a minimum possible weight corresponding to complete disagreement, the true-match component will have a longer left tail than right tail and the false-match component will have a longer right tail than left tail. The truncation for the variance ratio was based on an assumption that false-match weights will exhibit more variability than true-match weights for these data on the transformed scale as well as on the original scale. For simplicity, the geometric-mean terms in the transformations ( ωF and ωT) were simply fixed at the geometric mean of the component geometric means from the two previous sites. If the methods had not worked as well as they did with test and simulated data, then we would have also reflected uncertainty in these parameters. Due to the structure of our problem, in which the role of the prior distribution is to represent observable variability in global parameters from training data, we presume that the functional form of the prior is not too important as long as variability in global parameters is represented accurately. That is, we anticipate that any one of a number of methods that reflect uncertainty in the parameters estimated from training data will yield interval estimates with approximately the correct coverage properties, i.e., nominal (1 − α) × 100% interval estimates will cover the true value of the estimand approximately (1 − α) × 100% or more of the time. Alternative specifications for the prior distribution were described by Belin (1991). When we fit multiple mixture models to average over uncertainty in the parameters estimated by prior data (i.e., when we use the “full strategy” of Section 2.1), the multiple-imputation framework of Rubin (1987) can be invoked to combine estimates and standard errors from the separate models to provide one inference. Suppose that we fit m mixture models corresponding to m separate draws of the global parameters from their priors and thereby obtain false-match rate estimates e1, e2, . . . , em and variance estimates u1, u2, . . . , um, where u1 = SE2 (e1) is obtained by the method of Section 2.4.2. Following Rubin (1987, p. 76), we can estimate the false-match rate by and its squared standard error by

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Monte Carlo evaluations documented by Rubin (1987, secs. 4.6–4.8) illustrate that even a few imputations (m = 2, 3, or 5) are enough to produce very reasonable coverage properties for interval estimates in many cases. The combination of estimation procedures that condition on global parameters with a multiple-imputation procedure to obtain inferences that average over those global parameters is a powerful technique that can be applied quite generally. 3. PERFORMANCE OF CALIBRATION PROCEDURE ON CENSUS COMPUTER-MATCHING DATA 3.1 Results From Test-Census Data We use the test-census data described in Section 1.3 to illustrate the performance of the proposed calibration procedure, where determinations by clerks are the best measures available for judging true-match and false-match status. With three separate sites available, we were able to apply our strategy three times, with two sites serving as training data and the mixture-model procedure applied to the data from the third site. We display results from each of the three tests in Figure 2. The dotted curve represents predicted false-match rates obtained from the mixture-model procedure, with accompanying 95% intervals represented by the dashed curves. Also plotted are the observed false-match rates, denoted by the “O” symbol, associated with each of several possible choices of cutoff values between matches and nonmatches. We call attention to several features of these plots. First, it is clearly possible to match large proportions of the files with little or no error. Second, the quality of candidate matches becomes dramatically worse at some point where the false-match rate accelerates. Finally, the calibration procedure performs very well in all three tests from the standpoint of providing predictions that are close to the true values and interval estimates that include the true values. In Figure 3 we take a magnifying glass to the previous displays to highlight the behavior of the calibration procedure in the region of interest where the false-match rate accelerates. That the predicted false-match rate curves bend upward close to the points where the observed false-match rate curves rise steeply is a particularly encouraging feature of the calibration method. For comparison with the logistic-regression approach, we report in Table 2 (p. 704) the estimated false-match rates across the various sites for records with weights in the interval [−5, 0], which in practice contains both true matches and false matches. Two alternative logistic regression models— one in which logit(η) is modeled as a linear function of matching weight and the other in which logit(η) is modeled as a quadratic function of matching weight, where η is the probability of false match—were fitted to data from two sites to predict false-match rates in the third site. A predictive standard error to reflect binomial sampling variability, as well as uncertainty in parameter estimation, was calculated using Figure 2. Performance of Calibration Procedure on Test-Census Data: (a) St. Louis, Using Columbia and Washington as Training Sample; (b) Columbia, Using St. Louis and Washington as Training Sample; (c) Washington. Using St. Louis and Columbia as Training Sample. O = observed false-match rate; . . . = predicted false-match rate; . . . = upper and lower 95% bounds.

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition introduced operations-research-based methods that both provided a means of checking the logical consistency of an edit system and assured that an edit-failing record could always be updated with imputed values, so that the revised record satisfies all edits. An additional advantage of Fellegi-Holt systems is that their edit methods tie directly with current methods of imputing microdata (e.g., Little and Rubin 1987). Although we will only consider continuous data in this paper, EI techniques also hold for discrete data and combinations of discrete and continuous data. In any event, suppose we have continuous data. In this case a collection of edits might consist of rules for each record of the form c1X < Y < c2X. In words, Y can be expected to be greater than c1X and less than c2X; hence, if Y less than c1X and greater than c2X, then the data record should be reviewed (with resource and other practical considerations determining the actual bounds used). Here Y may be total wages, X the number of employees, and c1 and c2 constants such that c1 < c2. When an (X, Y) pair associated with a record fails an edit, we may replace, say, Y with an estimate (or prediction). Record Linkage A record linkage process attempts to classify pairs in a product space A × B from two files A and B into M, the set of true links, and U, the set of true nonlinks. Making rigorous concepts introduced by Newcombe (e.g., Newcombe et al., 1959; Newcombe et al, 1992), Fellegi and Sunter (1969) considered ratios R of probabilities of the form R = Pr ((γεΓ|M)/Pr((γεΓ|U) where γ is an arbitrary agreement pattern in a comparison space Γ. For instance, Γ might consist of eight patterns representing simple agreement or not on surname, first name, and age. Alternatively, each γεΓ might additionally account for the relative frequency with which specific surnames, such as Scheuren or Winkler, occur. The fields compared (surname, first name, age) are called matching variables. The decision rule is given by If R > Upper , then designate pair as a link. If Lower ≤ R ≤ Upper, then designate pair as a possible link and hold for clerical review. If R < Lower, then designate pair as a nonlink. Fellegi and Sunter (1969) showed that this decision rule is optimal in the sense that for any pair of fixed bounds on R, the middle region is minimized over all decision rules on the same comparison space Γ. The cutoff thresholds, Upper and Lower, are determined by the error bounds. We call the ratio R or any monotonely increasing transformation of it (typically a logarithm) a matching weight or total agreement weight.

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition With the availability of inexpensive computing power, there has been an outpouring of new work on record linkage techniques (e.g., Jaro, 1989, Newcombe, Fair, Lalonde, 1992, Winkler, 1994, 1995). The new computer-intensive methods reduce, or even sometimes eliminate, the need for clerical review when name, address, and other information used in matching is of reasonable quality. The proceedings from a recently concluded international conference on record linkage showcases these ideas and might be the best single reference (Alvey and Jamerson, 1997). Simulation Setting Matching Scenarios For our simulations, we considered a scenario in which matches are virtually indistinguishable from nonmatches. In our earlier work (Scheuren and Winkler, 1993), we considered three matching scenarios in which matches are more easily distinguished from nonmatches than in the scenario of the present paper. In both papers, the basic idea is to generate data having known distributional properties, adjoin the data to two files that would be matched, and then to evaluate the effect of increasing amounts of matching error on analyses. Because the methods of this paper work better than what we did earlier, we only consider a matching scenario that we label “Second Poor,” because it is more difficult than the poor (most difficult) scenario we considered previously. We started here with two population files (sizes 12,000 and 15,000), each having good matching information and for which true match status was known. The settings were examined: high, medium and low—depending on the extent to which the smaller file had cases also included in the larger file. In the high file inclusion situation, about 10,000 cases are on both files for an file inclusion or intersection rate on the smaller or base file of about 83%. In the medium file intersection situation, we took a sample of one file so that the intersection of the two files being matched was approximately 25%. In the low file intersection situation, we took samples of both files so that the intersection of the files being matched was approximately 5%. The number of intersecting cases, obviously, bounds the number of true matches that can be found. We then generated quantitative data with known distributional properties and adjoined the data to the files. These variations are described below and displayed in Figure 1 where we show the poor scenario (labeled “first poor”) of our previous 1993 paper and the “second poor” scenario used in this paper. In the figure, the match weight, the logarithm of R, is plotted on the horizontal axis with the frequency, also expressed in logs, plotted on the vertical axis. Matches (or true links) appear as asterisks (*), while nonmatches (or true nonlinks) appear as small circles (o). “First Poor” Scenario (Figure 1a) The first poor matching scenario consisted of using last name, first name, one address variation, and age. Minor typographical errors were introduced independently into one fifth of the last names and one third of the first names in one of the files. Moderately severe typographical errors were made independently in one fourth of the addresses of the same file. Matching probabilities were chosen that deviated substantially from optimal. The intent was for the links to be made in a manner that a practitioner might choose after gaining only a little experience. The situation is analogous to that of using administrative lists of individuals where information used in matching is of poor quality. The true mismatch rate here was 10.1%.

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Figure 1a. 1st Poor Matching Scenario “Second Poor” Scenario (Figure 1b) The second poor matching scenario consisted of using last name, first name, and one address variation. Minor typographical errors were introduced independently into one third of the last names and one third of the first names in one of the files. Severe typographical errors were made in one fourth of the addresses in the same file. Matching probabilities were chosen that deviated substantially from optimal. The intent was to represent situations that often occur with lists of businesses in which the linker has little control over the quality of the lists. Name information—a key identifying characteristic —is often very difficult to compare effectively with business lists. The true mismatch rate was 14.6%. Figure 1b. 2nd Poor Matching Scenario

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Summary of Matching Scenarios Clearly, depending on the scenario, our ability to distinguish between true links and true nonlinks differs significantly. With the first poor scenario, the overlap, shown visually between the log-frequency-versus-weight curves is substantial (Figure 1a); and, with the second poor scheme, the overlap of the log-frequency-versus-weight curves is almost total (Figure 1b). In the earlier work, we showed that our theoretical adjustment procedure worked well using the known true match rates in our data sets. For situations where the curves of true links and true nonlinks were reasonably well separated, we accurately estimated error rates via a procedure of Belin and Rubin (1995) and our procedure could be used in practice. In the poor matching scenario of that paper (first poor scenario of this paper), the Belin-Rubin procedure was unable to provide accurate estimates of error rates but our theoretical adjustment procedure still worked well. This indicated that we either had to find an enhancement to the Belin-Rubin procedures or to develop methods that used more of the available data. (That conclusion, incidentally, from our earlier work led, after some false starts, to the present approach.) Quantitative Scenarios Having specified the above linkage situations, we used SAS to generate ordinary least squares data under the model Y = 6 X + e,. The X values were chosen to be uniformly distributed between 1 and 101. The error terms, are normal and homoscedastic with variances 13000, 36000, and 125000, respectively. The resulting regressions of Y on X have R2 values in the true matched population of 70%, 47%, and 20%, respectively. Matching with quantitative data is difficult because, for each record in one file, there are hundreds of records having quantitative values that are close to the record that is a true match. To make modeling and analysis even more difficult in the high file overlap scenario, we used all false matches and only 5% of the true matches; in the medium file overlap scenario, we used all false matches and only 25% of true matches. (Note: Here to heighten the visual effect, we have introduced another random sampling step, so the reader can “see” better in the figures the effect of bad matching. This sample depends on the match status of the case and is confined only to those cases that were matched, whether correctly or falsely.) A crucial practical assumption for the work of this paper is that analysts are able to produce a reasonable model (guesstimate) for the relationships between the noncommon quantitative items. For the initial modeling in the empirical example of this paper, we use the subset of pairs for which matching weight is high and the error-rate is low. Thus, the number of false matches in the subset is kept to a minimum. Although neither the procedure of Belin and Rubin (1995) nor an alternative procedure of Winkler (1994), that requires an ad hoc intervention, could be used to estimate error rates, we believe it is possible for an experienced matcher to pick out a low-error-rate set of pairs even in the second poor scenario. Simulation Results Most of this Section is devoted to presenting graphs and results of the overall process for the second poor scenario, where the R2 value is moderate, and the intersection between the two files is high. These results best illustrate the procedures of this paper. At the end of the Section (in subsection 4.8), we summarize results over all R2 situations and all overlaps. To make the modeling more difficult and show the power of the analytic linking methods, we use all false matches and a random sample of only 5% of the true matches. We only consider pairs having matching weight above a lower bound that we determine based on analytic considerations and experience. For the pairs of our analysis, the restriction causes the number of false matches to significantly exceed the number of true matches. (Again, this is done to heighten the visual effect of matching failures and to make the problem even more difficult.)

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition To illustrate the data situation and the modeling approach, we provide triples of plots. The first plot in the triple shows the true data situation as if each record in one file was linked with its true corresponding record in the other file. The quantitative data pairs correspond to the truth. In the second plot, we show the observed data. A high proportion of the pairs is in error because they correspond to false matches. To get to the third plot in the triple, we model using a small number of pairs (approximately 100) and then replace outliers with pairs in which the observed Y-value is replaced with a predicted Y-value. Initial True Regression Relationship In Figure 2a, the actual true regression relationship and related scatterplot are shown, for one of our simulations, as they would appear if there were no matching errors. In this figure and the remaining ones, the true regression line is always given for reference. Finally, the true population slope or beta coefficient (at 5.85) and the R2 value (at 43%) are provided for the data (sample of pairs) being displayed. Figure 2a. 2nd Poor Scenario, 1st Pass All False & 5% True Matches, True Data, HighOverlap 1104 Points, beta=5.85, R–square=0.43 Regression after Initial RL ⇒ RA Step In Figure 2b, we are looking at the regression on the actual observed links—not what should have happened in a perfect world but what did happen in a very imperfect one. Unsurprisingly, we see only a weak regression relationship between Y and X. The observed slope or beta coefficient differs greatly from its true value (2.47 v. 5.85). The fit measure is similarly affected—falling to 7% from 43%. Figure 2b. 2nd Poor Scenario, 1st Pass All False & 5% True Matches, Observed Data, HighOverlap 1104 Points, beta=2.47, R–square=0.07

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Regression After First Combined RL ⇒ RA ⇒ EI ⇒ RA Step Figure 2c completes our display of the first cycle of the iterative process we are employing. Here we have edited the data in the plot displayed as follows. First, using just the 99 cases with a match weight of 3.00 or larger, an attempt was made to improve the poor results given in figure 2b. Using this provisional fit, predicted values were obtained for all the matched cases; then outliers with residuals of 460 or more were removed and the regression refit on the remaining pairs. This new equation, used in figure 2c, was essentially Y = 4.78X + ε, with a variance of 40000. Using our earlier approach (Scheuren and Winkler, 1993), a further adjustment was made in the estimated beta coefficient from 4.78 to 5.4. If a pair of matched records yielded an outlier, then predicted values (not shown) using the equation Y = 5.4X were imputed. If a pair does not yield an outlier, then the observed value was used as the predicted value. Figure 2c. 2nd Poor Scenario, 1st Pass All False & 5% True Matches, Outlier—Adjusted Data 1104 Points, beta= 4.78, R–square=0.40 Second True Reference Regression Figure 3a displays a scatterplot of X and Y as they would appear if they could be true matches based on a second RL step. Note here that we have a somewhat different set of linked pairs this time from earlier, because we have used the regression results to help in the linkage. In particular, the second RL step employed the predicted Y values as determined above; hence it had more information on which to base a linkage. This meant that a different group of linked records was available after the second RL step. Since a considerably better link was obtained, there Figure 3a. 2nd Poor Scenario, 2nd Pass All False & 5% True Matches, True Data, High Overlap 650 Points, beta= 5.91, R–square=0.48

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition were fewer false matches; hence our sample of all false matches and 5% of the true matches dropped from 1104 in Figures 2a thru 2c to 650 for Figures 3a thru 3c. In this second iteration, the true slope or beta coefficient and the R2 values remained, though, virtually identical for the estimated slope (5.85 v. 5.91) and fit (43% v. 48%). Regression After Second RL ⇒ RA Step In Figure 3b, we see a considerable improvement in the relationship between Y and X using the actual observed links after the second RL step. The estimated slope has risen from 2.47 initially to 4.75 here. Still too small but much improved. The fit has been similarly affected, rising from 7% to 33%. Figure 3b. 2nd Poor Scenario, 2nd Pass All False & 5% True Matches, Observed Data, High Overlap 650 Points, beta=4.75, R–square=0.33 Regression After Second Combined RL⇒RA⇒EI⇒RA Step Figure 3c completes the display of the second cycle of our iterative process. Here we have edited the data as follows. Using the fit (from subsection 4.5), another set of predicted values was obtained for all the matched cases (as in subsection 4.3). This new equation was essentially Y = 5.26X + ε, with a variance of about 35000. If a pair of matched records yields an outlier, then predicted values using the equation Y = 5.3X were imputed. If a pair does not yield an outlier, then the observed value was used as the predicted value. Figure 3c. 2nd Poor Scenario, 2nd Pass All False & 5% True Matches, Outlier—Adjusted Data 650 Points, beta=5.26, R–square=0.47

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Additional Iterations While we did not show it in this paper, we did iterate through a third matching pass. The beta coefficient, after adjustment, did not change much. We do not conclude from this that asymptotic unbiasedness exists; rather that the method, as it has evolved so far, has a positive benefit and that this benefit may be quickly reached. Further Results Our further results are of two kinds. We looked first at what happened in the medium R2 scenario (i.e., R2 equal to 47) for the medium- and low- file intersection situations. We further looked at the cases when R2 was higher (at 70) or lower (at 20). For the medium R2 scenario and low intersection case the matching was somewhat easier. This occurs because there were significantly fewer false-match candidates and we could more easily separate true matches from false matches. For the high R2 scenarios, the modeling and matching were also more straightforward than there were for the medium R2 scenario. Hence, there were no new issues there either. On the other hand, for the low R2 scenario, no matter what degree of file intersection existed, we were unable to distinguish true matches from false matches, even with the improved methods we are using. The reason for this, we believe, is that there are many outliers associated with the true matches. We can no longer assume, therefore, that a moderately higher percentage of the outliers in the regression model are due to false matches. In fact, with each true match that is associated with an outlier Y-value, there may be many false matches that have Y-values that are closer to the predicted Y-value than the true match. Comments and Future Study Overall Summary In this paper, we have looked at a very restricted analysis setting: a simple regression of one quantitative dependent variable from one file matched to a single quantitative independent variable from another file. This standard analysis was, however, approached in a very nonstandard setting. The matching scenarios, in fact; were quite challenging. Indeed, just a few years ago, we might have said that the “second poor” matching scenario appeared hopeless. On the other hand, as discussed below, there are many loose ends. Hence, the demonstration given here can be considered, quite rightly in our view, as a limited accomplishment. But make no mistake about it, we are doing something entirely new. In past record linkage applications, there was a clear separation between the identifying data and the analysis data. Here, we have used a regression analysis to improve the linkage and the improved linkage to improve the analysis and so on. Earlier, in our 1993 paper, we advocated that there be a unified approach between the linkage and the analysis. At that point, though, we were only ready to propose that the linkage probabilities be used in the analysis to correct for the failures to complete the matching step satisfactorily. This paper is the first to propose a completely unified methodology and to demonstrate how it might be carried out. Planned Application We expect that the first applications of our new methods will be with large business data bases. In such situations, noncommon quantitative data are often moderately or highly correlated and the quantitative variables

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition (both predicted and observed) can have great distinguishing power for linkage, especially when combined with name information and geographic information, such as a postal (e.g., ZIP) code. A second observation is also worth making about our results. The work done here points strongly to the need to improve some of the now routine practices for protecting public use files from reidentification. In fact, it turns out that in some settings—even after quantitative data have been confidentiality protected (by conventional methods) and without any directly identifying variables present—the methods in this paper can be successful in reidentifying a substantial fraction of records thought to be reasonably secure from this risk (as predicted in Scheuren, 1995). For examples, see Winkler, 1997. Expected Extensions What happens when our results are generalized to the multiple regression case? We are working on this now and results are starting to emerge which have given us insight into where further research is required. We speculate that the degree of underlying association R2 will continue to be the dominant element in whether a usable analysis is possible. There is also the case of multivariate regression. This problem is harder and will be more of a challenge. Simple multivariate extensions of the univariate comparison of Y values in this paper have not worked as well as we would like. For this setting, perhaps, variants and extensions of Little and Rubin (1987, Chapters 6 and 8) will prove to be a good starting point. “Limited Accomplishment” Until now an analysis based on the second poor scenario would not have been even remotely sensible. For this reason alone we should be happy with our results. A closer examination, though, shows a number of places where the approach demonstrated is weaker than it needs to be or simply unfinished. For those who want theorems proven, this may be a particularly strong sentiment. For example, a convergence proof is among the important loose ends to be dealt with, even in the simple regression setting. A practical demonstration of our approach with more than two matched files also is necessary, albeit this appears to be more straightforward. Guiding Practice We have no ready advise for those who may attempt what we have done. Our own experience, at this point, is insufficient for us to offer ideas on how to guide practice, except the usual extra caution that goes with any new application. Maybe, after our own efforts and those of others have matured, we can offer more. References Alvey, W. and Jamerson, B. (eds.) ( 1997), Record Linkage Techniques —1997 (Proceedings of An International Record Linkage Workshop and Exposition, March 20–21, 1997, in Arlington VA). Belin, T. R,, and Rubin, D.B. ( 1995). A Method for Calibrating False-Match Rates in Record Linkage, Journal of the American Statistical Association, 90, 694–707. Fellegi, I. and Holt, T. ( 1976). A Systematic Approach to Automatic Edit and Imputation, Journal of the of the American Statistical Association, 71, 17–35.

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Fellegi, I. and Sunter, A. ( 1969). A Theory of Record Linkage, Journal of the of the American Statistical Association, 64, 1183–1210. Jabine, T.B. and Scheuren, F. ( 1986). Record Linkages for Statistical Purposes: Methodological Issues, Journal of Official Statistics, 2, 255–277. Jaro, M.A. ( 1989). Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida, Journal of the American Statistical Association, 89, 414–420. Little, R.J. A. and Rubin, D.B. ( 1987). Statistical Analysis with Missing Data, J.Wiley: New York. Newcombe, H.B.; Kennedy, J.M.; Axford, S.J.; and James, A.P. ( 1959). Automatic Linkage of Vital Records, Science, 130, 954–959. Newcombe, H.; Fair, M.; and Lalonde, P. ( 1992). The Use of Names for Linking Personal Records, Journal of the American Statistical Association, 87, 1193–1208. Oh, H.L. and Scheuren, F. ( 1975). Fiddling Around with Mismatches and Nonmatches, Proceedings of the Section on Social Statistics, American Statistical Association. Scheuren, F. and Winkler, W.E. ( 1993). Regression Analysis of Data Files that are Computer Matched, Survey Methodology, 19, 39–58. Scheuren, F. ( 1995). Review of Private Lives and Public Policies, Journal of the American Statistical Association, 90. Scheuren, F. and Winkler, W.E. ( 1996). Recursive Merging and Analysis of Administrative Lists and Data, Proceedings of the Section of Government Statistics, American Statistical Association, 123–128. Winkler, W.E. ( 1994). Advanced Methods of Record Linkage, Proceedings of the Section of Survey Research Methods, American Statistical Association, 467–472. Winkler, W.E. ( 1995). Matching and Record Linkage, in B.G.Cox et al. (ed.), Business Survey Methods, New York: J.Wiley, 355–384. Winkler, W.E. and Scheuren, F. ( 1995). Linking Data to Create Information, Proceedings of Statistics Canada Symposium, 95. Winkler, W.E. and Scheuren, F. ( 1996). Recursive Analysis of Linked Data Files, Proceedings of the 1996 Census Bureau. Winkler, W.E. ( 1997), Producing Public-Use Microdata That are Analytically Valid and Confidential Paper presented at the 1997 Joint Statistical Meetings in Anaheim

OCR for page 79
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition This page in the original is blank.