Chapter 11

Selected Related Papers, 1986–1997

Authors:

Howard B.Newcombe, Consultant, and Martha E.Fair and Pierre Lalonde, Statistics Canada

Matthew A.Jaro, System Automation Corporation

Nancy P.NeSmith, The Church of Jesus Christ of Latter-Day Saints

David White, Utah State University and Church of Jesus Christ of Latter-Day Saints

William E.Winkler, Bureau of the Census

Fritz Scheuren, Ernst and Young, LLP

Martha E.Fair, Statistics Canada

Latanya Sweeney, Massachusetts Institute of Technology



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Chapter 11 Selected Related Papers, 1986–1997 Authors: Howard B.Newcombe, Consultant, and Martha E.Fair and Pierre Lalonde, Statistics Canada Matthew A.Jaro, System Automation Corporation Nancy P.NeSmith, The Church of Jesus Christ of Latter-Day Saints David White, Utah State University and Church of Jesus Christ of Latter-Day Saints William E.Winkler, Bureau of the Census Fritz Scheuren, Ernst and Young, LLP Martha E.Fair, Statistics Canada Latanya Sweeney, Massachusetts Institute of Technology

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition This page in the original is blank.

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition The Use of Names for Linking Personal Records Howard B.Newcombe, Consultant Martha E.Fair and Pierre Lalonde, Statistics Canada The skill of a human who searches large files of personal records depends much on prior knowledge of how the names vary in successive documents pertaining to the same individuals (e.g., as with ANTHONY-TONY, JOSEPH-JOE, WILLIAM-BILL). Now, an essentially exact procedure enables computers to make similar use of an accumulated memory of their own past experiences when searching for, and linking, records that relate to particular persons. This knowledge is further applied to quantify the benefits from various refinements of the rules by which the discriminating powers of names are calculated when they do not precisely agree or are substantially dissimilar. Of the six refinements tested, by far the most important is the recently developed exact approach for calculating the ODDS associated with comparisons of names that are possible synonyms. KEY WORDS: Data base maintenance; File searching; Probabilistic linkage; Quantitative judgment; Record linkage. Personal documentation in machine-readable form has become so extensive in any advanced society as to constitute, collectively, a detailed but highly fragmented life history for virtually all its members. The files exist to serve the needs of people and of society as a whole, and frequent access is involved. Much of the searching is necessarily based on names and personal particulars that are apt to be reported differently on successive documents for the same individuals. The problems are familiar to clerks, but now access by computer is becoming the norm. With automated searching, many choices are possible between refinements and simplifications in the way that names get compared. Rarely, however, have the merits of alternative approaches been quantified in terms of gains or losses of discriminating power, so as to reduce the guesswork when designing a system. The potential for sophistication in automated comparisons of names is substantial. Humans develop special skills in recognizing nicknames, ethnic variants, diminutives, and corrupted forms due to truncations, misspellings, and typographical errors. This is known to be based on a relatively simple rationale, supported by remembered data. If a machine is to acquire similar ability, it too must rely on past experience (Newcombe, Fair, and Lalonde 1989; Newcombe, Kennedy, Axford, and James 1959). Although there is now an essentially exact way of measuring the discriminating powers of comparison pairs like CARL-KARL, GEORGE-GYORGY, JACOB-JAKE, JOHN-JACK, and WILLIAM-BILL, much clerical labor and large amounts of data are needed to set it up (Fair, Lalonde, and Newcombe 1990, 1991; Newcombe et al. 1989). Simpler comparisons are, therefore, likely to remain popular in many procedures that use names to access files. Whether or not this exact approach becomes widely applied, its existence now provides a convenient standard against which to judge the performance of other treatments of names. So we have used the approach in this article to quantify the gains and losses of discriminating power due to various refinements and shortcuts commonly used in automated searching and linkage. The test is special to names as identifiers; is suitable for fine-tuning this component of a record linkage system; and is uninfluenced by the adequacy of the rest of the identifiers. It differs from, but is complementary to, more direct tests of overall performance. 1. COMPUTER LINKAGE Where a computer is used to search large files of personal records and bring together the records for particular individuals, it may emulate with varying degrees of success the strategies of a human clerk who does the same job. To determine whether a pair of records is correctly matched, the names are compared along with other identifiers (e.g., year, month, and day of birth; sex and marital status; and various geographic particulars such as place of birth, residence, work, or death). Sometimes, however, these comparisons point in different directions. The problem then is to determine, as in a court of law, where the preponderance of the evidence lies. The comparisons must be considered not only separately but also in combination. A particular comparison outcome (e.g., JOHN-JOHN or JOHN-JACK) will argued for linkage when it is more common among correctly matched pairs than among random false matches. Conversely (as with JOHN-JOE), an outcome will argue against linkage when the opposite is the case. These likelihood ratios (or individual ODDS in favor of linkage) may be combined to assess the collective evidence from the full set. But this is not the whole of the relevant information. In addition, a human clerk may recognize two further factors: the size of the file being searched and the likelihood that the individual is represented in it. Thus, when looking for a particular JOHN BROWN in the telephone directory for a small town where he is thought to reside, finding the name suggests that it may well belong to the right person. This would definitely not be so when searching a large national death register, especially if this JOHN BROWN were unlikely to have died. Automated searches have from the outset used much the same reasoning as does a human clerk; this provides numerous options when calculating the ODDS for particular * Howard B.Newcombe is a consultant, P.O. Box 135, Deep River, Ontario K0J 1P0, Canada. Martha E.Fair is Chief and Pierre Lalonde is Project Manager, Occupational and Environmental Health Research Section, Canadian Centre for Health Information, Statistics Canada, Ottawa, Ontario K1A 0T6, Canada. The authors thank John Armstrong, Michael Eagen, and William E.Winkler for helpful critical comments on an early version of this article, and also the associate editors and referees who substantially influenced its final form.

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition identifiers. However, where the two factors described in the previous paragraph are wrongly overlooked, confusion arises and persists concerning the distinction between overall relative ODDS as opposed to absolute ODDS (in the sense of true “betting odds” that the record pair is correctly matched). Formal theory, which was introduced later, has as yet not dealt explicitly with the implications of this distinction. These matters are considered further in this article. Clerical searchers are not technical people. So, when describing the insights on which their success depends, it is best that plain language be used such as they would understand. 1.1 Brief History The earliest probabilistic linkages were carried out more than three decades ago as a hands-on kind of experiment at the Chalk River Laboratories of Atomic Energy of Canada Limited in Ontario (Newcombe et al. 1959). This was a conscious attempt to observe and understand the stratagems of a perceptive human when confronted with pairs of records that might, or might not, relate to the same individual or family. This study found that people often compare identifiers in unexpected ways whenever there is a need for added discriminating power. The circumstances surrounding this laboratory study were particularly favorable in that the records contained an abundance of personal identifiers. Birth registrations for the province of British Columbia were to be linked into sibship groups along with parental marriage records, under conditions of strict confidentiality. Identifiers for husbands and wives were present and included maiden surnames. S.J.Axford of Statistics Canada suggested that both parental surnames be phonetically (Soundex) coded and that the files be sorted by the male code followed by the female code. This created pockets (or “blocks”) within which other identifiers could be compared. (For details of the coding, see Newcombe et al. 1959 and Newcombe 1988.) For a preliminary test, Axford produced a modest listing of stillbirths spanning a number of years, arrayed in the double surname sequence. The visual impact was substantial and influenced much of the thinking that followed. Stillbirths repeat in families, and the discriminating power of the double surname codes ensured that most of the records in a single pocket would already be correctly grouped. Where more than one sibship was represented, appropriate separation was indicated by the parental initials, provinces or foreign countries of birth, and their ages when adjusted for the intervals between events. A visual scan revealed how often various identifiers agreed or disagreed in correctly matched pairs (LINKS), and it was not difficult to determine the corresponding likelihoods for a control group of falsely matched random pairs (NONLINKS). The linkage rationale first emerged from this small manual test with the stillbirth records. As a broad generalization, any outcome from the comparison of any identifier will argue for linkage if it is more typical of the LINKS and against linkage if it is more typical of the random NONLINKS. The identifier might be a surname, given name, initial, some part of the date of birth, or perhaps the place of birth; the comparison outcome might be an agreement, disagreement, some specified level or kind of similarity or dissimilarity, or any other comparison outcome no matter how defined. The reasoning holds even where agreements argue against linkage (as with two stillbirth records that both have birth order = 1), and even where two different identifiers are compared (as when the birth of a fifth child seems unlikely out of the first year of a marriage). Without exception, both the direction and the strength of the evidence are indicated by the likelihood ratio. There appear to be no constraints limiting how a comparison outcome may be defined, This linkage rationale next was applied clerically on a larger scale to searches for parental marriage records, initiated by British Columbia birth records. When the rationale had been shown to work in a manual simulation of an automated procedure, J.M.Kennedy was asked if he could program the same steps for Chalk River's first-generation (Datatron) computer. The automated separation of LINKS from NONLINKS likewise left only a small proportion of doubtful pairings (Newcombe et al. 1959). A decade later, the linkage rationale was restated and expanded into a formal mathematical theory by Fellegi and Sunter at Statistics Canada (Fellegi 1985; Fellegi and Sunter 1969; Sunter 1968). These authors confirmed what had not been rigorously proved earlier and chose for purposes of illustration such simple outcome definitions as disagreement (nonspecific for value) and agreement (with or without recognition of the value of the identifier; for example, “name agrees and has any value” or “name agrees and the value is JOHN”). Thus the theory is best viewed not so much as a blueprint for linkage, but more as a framework within which many options are possible. Its existence does not make the human strategies less relevant or the intimate contact with the files less important. Meanwhile, independent of the formal theory and prior to it linkage practice and manual testing resulted in refinements of a different sort. The aim was to make maximum use of any conceivable source of discriminating power in the available identifiers. Because simple agreement and disagreement outcomes are grossly wasteful in many situations, more sophisticated comparison procedures were developed. Pairs of initials that disagreed on straight comparison were now routinely cross-compared to pick up instances of inversion. Near agreements in the birthdate components (e.g., discrepancies of 1, 2, 3, and so forth days, months, or years) were now grouped into multiple levels. Also, unusual comparisons were being made (as between the birth order of a child and the duration of the marriage). All of these refinements served to exploit hidden discriminating power (Newcombe and Kennedy 1962). Other refinements were tested by measuring the benefits from using additional identifiers (e.g., parental ages), multiple alternative file sequences for blocking, and a coefficient of specificity to identify the “best” sequence when relying on just one (Newcombe 1967; Smith and Newcombe 1975, 1979). From the beginning, frequent close scrutiny of difficult matches provided insights that would have been missed had refinement been sought through

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition theory alone (Newcombe and Kennedy 1962; Newcombe et al. 1987). (The need for practice and theory to complement each other is discussed elsewhere; see Scheuren, Alvey, and Kilss 1986; Winkler 1989b.) When Statistics Canada first actually used probabilistic linkage in the early 1980s, based on the Fellegi-Sunter theory, it was to search the newly established Canadian mortality data base, which extended back to 1950. Their linkage system, known as CANLINK or GIRLS (for Generalized Iterative Record Linkage System) included innovations described by Howe and Lindsay (1981), Hill (1981), Hill and Pring-Mill (1985). In particular, a preliminary linkage step was introduced that temporarily ignored specific values of names, thereby eliminating in a simple fashion many unpromising record pairs, and an iterative update of the outcome frequencies from LINKED pairs of records was used. The preliminary step was needed because the death files were now blocked by just a single surname (as a NYSIIS phonetic code; see Appendix H of Newcombe 1988), and the blocks were larger than those based on pairs of surnames for family linkage. The iterative updates were required because to get new linkage jobs started, outcome frequencies from earlier linkages were often used initially and replaced later with increasingly appropriate data as the new files of LINKS were progressively improved. (The effect of omitting this update is considered in Sec. 3.4.) A further intended refinement, recognition of partial agreements of names (like THOMAS-TOM), was less successful; as a result, modified procedures had to be devised (Eagen and Hill 1987; Fair et al. 1990, 1991; Newcombe 1988; Newcombe et al. 1987, 1989; Winkler 1985, 1989a.) The matter is referred to again in Section 3.5. Howe and Lindsay (1981) also recognized explicitly, for the first time, the concept of the prior odds or prior likelihood but failed to apply it to create a scale of absolute ODDS that might be used for setting thresholds. Earlier, two thresholds had been proposed as part of the Fellegi-Sunter theory to distinguish positive links and positive nonlinks, plus an intermediate category of ambiguous matches called possible links. The thresholds were to be calculated in advance as “error bounds” that would limit the numbers of false-positive and false-negative links and would identify pairs in need of special assessment. But when the ODDS from the full sets of identifiers were combined, it was found that the resulting overall ODDS served only to array the record pairs, relative to one another, in descending order of the likelihood of a correct match. Thus, in practice, the two thresholds got assigned subjectively. On the scale of relative ODDS available at the time, they fell high above the crossover or 50/50 odds point (e.g., in the case of the death searches by a factor of well over 1 million, and greater than the size of the file being searched). An empirical conversion to a scale of presumed absolute ODDS indicated why. When allowance was made for the size of the death file, 1/N(File B), and for the proportion of search records that find a matching death record in it, N(A | LINK)/N(File A), the new scale brought the subjective thresholds close to the crossover or 50/50 odds point. Together, these two factors were taken to represent the prior likelihood of a correct match on a single random pairing (i.e., before examining any identifier or blocking information). The new scale of absolute ODDS was controversial at first, although the results were consistently believable over many empirical tests, whereas those from the alternative were not Later, it was shown to use just a variant of the prior odds, P(LINK)/P(NONLINK), already recognized by Howe and Lindsay (1981). The implications are substantial but were not explored by those authors (see Secs. 2.3 and 3.1 and Fig. 1). In practice, however, it was soon found that the concept of the prior likelihood could be applied with great flexibility in many ways. For example, as a refinement it was calculated separately for subsets with differing prior likelihoods (see Newcombe 1988, chap. 28 and apps. B and D.3). What refining the practice achieved, as distinct from formal theory, was enhanced flexibility in the access to discriminating power. Individual identifiers were compared freely, just as a human might do when seeking clues to the true linkage status of a record pair; and the prior likelihood of a correct match, in the case of a death search, was exploited to take into account the age of the individual in a given year, and the actuarial likelihood that he or she might have died in that year. For linkages of cancer records with death files, the approach even used survival curves appropriate to particular diagnoses. The practices are fully described, but in nontechnical language for those working close to the files, who design, implement, and test the detailed procedures (see, for example, Newcombe 1988, sec. 28.2 and apps. D.2 and D.3). This is the technological setting within which the current study has been carried out. 1.2 General Method Any formal statement of the comparison procedure for individual identifiers should allow for the flexibility that exists in practice. This is especially true of names when they do not precisely agree (e.g., as allowing recognition of the comparison DANIEL-DANNY). Moreover, because some kind of grouping of possible synonyms is inevitable, this too must be exceedingly flexible if discriminating power is not to be wasted (Scheuren 1985). We will deal first with formal expressions that permit flexibility when estimating likelihood ratios (or ODDS in favor of linkage as indicated by particular comparisons), and second with grouping under conditions of minimum constraints. (Other accounts use logarithms of the likelihood ratios and refer to them as “weights.” The ratios may also be viewed as factors by which comparisons of particular identifiers raise or lower the overall “betting odds ” in favor of linkage.) Conceptually, each first given name on one file is compared with every first given name on the other file, and second given names are likewise compared. Generally, LINKED pairs (of names or records) are vastly outnumbered by possible NONLINKED pairs, i.e., actual plus potential. (This concept is fundamental and is not altered by “blocking” that reduces the actual numbers of comparison pairs; see Fellegi 1985.) Although LINKS and NONLINKS are thought of as uncon-

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition taminated with pairs of the opposite kind, modest admixtures have only slight effects on the ODDS. When comparing value Ax from a Record A (which is used to initiate a search) with value By from a Record B (which is in the file being searched), the ODDS in favor of a correct LINK associated with outcome Ax · By (i.e., the comparison pair of values) may be written in terms of the relative probability of occurrence of the particular outcome in LINKS as compared with NONLINKS; that is, ODDS = P(Ax · By | LINK)/P(Ax·By | NONLINK). (1.1) But except where files A and B are both very small, the denominator in this expression will be closely approximated by P(Ax) · P(By), because any fortuitous LINKS in the random pairs will be vastly outnumbered by the NONLINKS. Thus the expression may be converted to ODDS = P(Ax · By | LINK)/P(Ax) · P(By). (1.2) This implies that we need to know in advance the number of LINKS with values Ax and By. In practice crude approximations are estimated initially from sample linkages carried out manually or from previous linkage studies and are revised iteratively as the current LINKS are progressively refined. An expanded form of this procedure is sometimes used to support an existing practice in the case of death searches. This involves ignoring the frequency of value Ax, both in File A and in the LINKS, on the grounds that names are unlikely to be strongly correlated with the probability of death and with whether a Record A is LINKED to a Record B. Justification depends on the magnitude of the error introduced by the assumption. The expanded version has two parts: (1.3) Current practice views the second part (the “correction factor”) as approximating unity, so it can be ignored, except where the assumption is thought to be seriously misleading (as it might be if ethnicity and ethnic names were correlated with mortality). What the relative probabilities fail to do is indicate explicitly how the ODDS should be calculated using data that are in short supply. Examples include outcome values Ax ·By that are represented only once or twice in an available real file of LINKS and, especially, numerous other outcome values representing pairs of possible synonyms that have not actually occurred in the available LINKS but probably would occur if that file were larger. Because crucial steps in the reasoning have to do with numbers of outcome values, as distinct from their likelihoods, it is helpful to convert the last two expressions to a form actually used to obtain estimated relative probabilities, as (1.4) and (1.5) where the general term N(* | LINK) represents the number of records among LINKED pairs that have attribute (*), N(LINKS) = number of linked pairs, N(A) = number of records in File A, N(B) = number of records in File B, N(Ax) = number of records in File A with value x, and N(By) = number of records in File B with value y. (For the origins of this version, see Newcombe et al. 1989.) It is convenient to retain the distinction between a search file (File A) and a file being searched (File B), even though conceptually the roles could be reversed. For one thing, the search file usually is smaller than the file being searched. Also, the distinction has special significance for the death searches, because informal versions of a given name (e.g., nicknames) are more commonly used by employers and others while one is alive rather than by undertakers after one has died. Here we need to introduce two concepts related to the ways in which the range of possible outcomes may be handled: Grouping or “pooling” of similar values of Ax·By, which individually are represented poorly or not at all in the available LINKS (the “quantity” problem) Increasing sacrifice of discrimination as the withingroup heterogeneity grows when its definition is broadened to ensure representation in the LINKS (the “quality” problem). A tradeoff between “quantity” and “quality” is unavoidable. The definition of an outcome group needs to be broad enough so that N(Ax · By | LINK) is represented by at least one comparison pair. Otherwise, no ODDS can be calculated. But because the definition is widened to increase the representation, it will also let more heterogeneity into the group. (Thus as the error due to statistical fluctuation diminishes, so the error due to lessened specificity increases.) The earliest linkage operations simplified matters by recognizing just two categories of outcome—agreements and disagreements—and by attributing specificity for value only to the former category. But major errors arose from an unsuccessful attempt to adapt the earlier procedures, to recognize “partial agreements” such as JOSEPH-JOE (Newcombe et al. 1987). (The term “partial agreement” is commonly applied, for reasons of convenience, to any possible synonyms regardless of similarity, as with ELIZABETH– BETTY.) The problem posed by the value-specific partial agreements of names may be handled in various ways, but only one of

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition these appears to be precise. A compromise solution, now in routine use, is based on the numbers of early characters that agree. ODDS are first calculated for different levels of agreement (i.e., one, two, three, four or more agree); actual values are ignored at this stage. Such “global ODDS” are later adjusted upward or downward, depending on whether the particular values of the agreement portions are rare or common (Eagen and Hill 1987; Newcombe 1988; Newcombe et al. 1987), but this neglects the values of the disagreement portions (e.g., it wrongly treats diverse name pairs like JOHN-JONATHAN and JOHN-JOSEPH as equally likely to be synonyms). An alternative approach that recognizes phonetic components common to the two names has also been developed (Winkler 1985, 1989a). A precise treatment of partial agreements of names recognizes both values in a comparison pair and avoids resorting to globally defined (i.e., value-nonspecific) levels of agreement. This permits it to deal with outwardly dissimilar comparison pairs (e.g., EDWARD-TED, MARGARET-PEGGY). Any necessary groupings must be defined in value-specific ways. The frequency with which the two values are related by actual usage then determines the magnitude of the precise ODDS. A modest manual test showed that the approach worked where sufficient data from LINKED pairs of records could be made available (Newcombe et al. 1989). That was followed by an expanded application based on an accumulated composite file of LINKS from many past searches of the Canadian mortality data base (Fair et al. 1990, 1991). This refinement will be considered further in Section 3.5. (The current emphasis on flexibility also extends to other identifiers that are apt to be reported differently on separate occasions or that may change over time, as with MARITAL STATUS, OCCUPATION, INDUSTRY, and PLACES OF RESIDENCE, WORK, and DEATH. For these, there likewise is no need to prejudge in which direction the comparisons will argue. “Agreement” and “disagreement” are often poor indicators, but the ODDS—when they have been calculated— will decide.) 1.3 Combining the ODDS When the likelihood ratios or ODDS for particular identifiers are combined over the full set in a record pair, it is usual to assume as a tolerable approximation that the identifiers are independent of one another. The overall absolute ODDS (in the sense of “betting odds” in favor of linkage) may then be represented by Absolute ODDS = R1 · R2 · · · · · Rn · P(LINK), (1.6) where R1 to Rn are the likelihood ratios (ODDS) for identifiers 1 to n (including any used for blocking) and are independent of each other, and P(LINK) is the prior likelihood of a correct match on a singly random pairing. The latter term is similar to the prior odds, P(LINK)/P(NONLINK), recognized but not used by Howe and Lindsay (1981). Confusion remains concerning the implications, and is not explicitly addressed by existing formal theory (see Sec. 2.1). The version of this expression used to calculate estimated absolute ODDS from actual counts is unfamiliar to many, so it is necessary to be explicit: R1 to Rn become frequency ratios, and P(LINK) becomes N(LINKS)/N(LINKS + NONLINKS). Because each linked pair contains one record from File A and one from File B, N(LINKS) = N(A|LINK) = N(B|LINK). Also, where each record on File A is compared in succession with every record on File B, the total number of comparison pairs, regardless of their linkage status, will together equal the product of the two file sizes; that is, N(LINKS + NONLINKS) = N(File A) · N(File B). The concept is valid even where, in practice, only the pairings that occur within blocks are actually seen; but this implies that likelihood ratios for blocking identifiers will be taken into account. Thus by substitution we may obtain Absolute ODDS (1.7) Howe and Lindsay (1981) had felt that their prior odds, P(LINK)/P(NONLINK), could not be readily estimated. The solution came to us by observing human stratagems and through reasoning based on counts rather than on probabilities. At first, it was hard to persuade others that this practice is valid, perhaps because our way of thinking was unconventional (David Binder and Geoffrey Howe, personal communication, November 10 to December 11, 1982). A further possible reason might be the common custom of not calculating frequency ratios for blocking identifiers; but then NA and NB would represent the sizes of Files A and B within the particular block, and the prior likelihoods would differ from block to block. Calculation (1.7) has been used over the past decade for searches of Canadian death files. The application is exceedingly flexible and allows refinement through redefinition of Files A and B to represent, separately, a multiplicity of subsets (based on age, death year, selected diagnoses, and so on) of populations that are internally heterogeneous. (For details, see Newcombe 1988 chap. 28 and apps. B and D.2.) 2. EMPIRICAL DISTRIBUTIONS OF LINKS AND NONLINKS A feedback of empirical data from the LINKS and NONLINKS is the most basic requirement of a linkage system. For example, the expressions by which the ODDS for the individual identifiers are calculated require these data as input. Also, such data are needed when assessing errors due to assumptions that are not strictly correct. Above all, direct observation of individual record pairs often yields clues to more suitable comparison steps. These clues are most likely to become apparent to humans when resolving difficult matches manually. An experienced person can be less bound by artificial constraints than the automated system, and he or she is still, given existing linkage systems, in a better position to be guided by memories of past encounters with similar problems. Theoretical papers on linkage make strong assumptions to get results, and linkage practice does the same to simplify

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition procedures. Examples include the use of artificially simplified ways of comparing names, which may not adequately exploit their true discriminating power, and the practice of simply multiplying the ODDS for individual identifiers to combine them for a whole set, which would be strictly proper only if they were independent of each other (Fellegi and Sunter 1969; Howe and Lindsay 1981). Only with better data from LINKS and NONLINKS can many of the uncertainties be resolved Recognition of this has led, in part, to the idea of accumulating large files of LINKS and creating even larger files of NONLINKS (see, for example, Fair et al. 1990, 1991; Lalonde 1989; Newcombe et al. 1989). It has also emphasized the use of additional evidence on the true linkage status of record pairs assigned borderline absolute ODDS in an automated operation (Fair, Newcombe, and Lalonde 1988a; Fair, Newcombe, Lalonde, and Poliquin 1988b). We will deal first with the latter point 2.1 The Assumption of Independence Calculated overall “absolute ODDS” usually assume that the components in the identifier sets are independent of each other. Rarely is this assumption strictly correct. It can be seen to be misleading when scanning visually for record pairs that were wrongly classed as positive LINKS and positive NONLINKS. Our unpublished observations include examples of multiple agreements (e.g., of rare ethnic names and related places of birth) that have spuriously raised the ODDS to create false positives. Conversely, there are examples of multiple disagreements (especially on year, month, and day of birth— perhaps due to multiple wrong guesses by an informant at the time of a death), which have spuriously lowered the ODDS to create false negatives. The effects of these and other such biases are best visualized in the overlap between the numbers of verified LINKS and NONLINKS, when distributed along a scale of absolute ODDS that assumes independence, as in Figure 1 (data of Fair et al. 1988a, 1988b; and Lalonde 1986). We will refer to points on this scale as “theoretical” ODDS to distinguish them from the “empirical” ODDS, which are the ratios of observed counts of LINKS/NONLINKS at various points on the same scale. (Total LINKS and NONLINKS are not shown in the Figure; but conceptually the latter vastly outnumber the former.) In practice there is no need to actually create the bulk of the possible NONLINKS, because most would fall so very low on the scale. Major misunderstanding arises, however, when the enormous preponderance of actual plus potential NONLINKS over LINKS is not kept in mind. Thus the distributions and their crossover points serve little purpose if Figure 1. Overlapping Parts of the Distributions of LINKS and NONLINKS, on a Scale of Theoretical ODDS (Lalonde 1986). Note that empirical error bounds (broken lines), set at the 1% levels, are displaced upward on the theoretical scale.

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition plotted as proportions of LINKS compared with proportions of NONLINKS. Likewise, upper and lower “error bounds,” when expressed in such terms, make nonsense of the concept. (The data in Figure 1 are from searches of 1,300,000 death records, initiated by 30,000 work records, yielding 2254 LINKS; the vital status of doubtful pairs was confirmed using taxation files. Because the number of possible pairings, i.e., actual plus potential, is the product of the two file sizes, NONLINKS outnumber LINKS by 17,000,000 to 1.) Marked discrepancies are revealed in Figure 1 between the theoretical ODDS scale, based on the assumption of independence, and the corresponding observed ratios of LINKS versus NONLINKS. For example, where the theory indicates that the ODDS in favor of linkage are 1/1, in reality they are only 1/6; and where the observed ODDS are 1/1, the theory says that they should be 16/1. Moreover, if one wants to set lower and upper thresholds to limit the number of LINKS wrongly classed as “positive nonlinks” to 1% of all LINKS, and to likewise limit the NONLINKS wrongly classed as “positive links” to a similar number (i.e., 1% of the LINKS), the correct thresholds would be represented by theoretical ODDS of approximately 1/4 and 2,000/1. Thus the true error bounds are displaced upward on a scale of ODDS that assumes independence. There has been confusion in the past, which is best avoided by thinking in terms of numbers (i.e., counts) as distinct from proportions. One does not limit false positives to 1% of NONLINKS because, in our example, that would create 17,000,000 times as many false positives as false negatives. Indeed, the Fellegi-Sunter theory emphasizes that NONLINKS typically will greatly outnumber LINKS; for example, see slides #9 and #10 of Fellegi 1985. More explicitly, where this is the case “no one could possibly conclude” that the two error bounds would be properly set at equal proportions (i.e., 1%) of the LINKS and of the NONLINKS (I. P.Fellegi, personal communication July 8, 1987). 2.2 Data on Name Comparisons Involving Synonyms Value-specific information to do with N(Ax · By | LINK), heretofore lacking in quantity, is contained in a composite file of 64,937 LINKED pairs of male given names derived from 26 linkage projects. All of the projects involved searches of the Canadian Mortality Data Base (File B, containing 3,397,860 male given names), initiated by records of various study cohorts, including employment records, survey responses, cancer registrations, birth records, and entries in a national radiation dose register (composite File A, containing Table 1. Common Male Given Names From the Canadian Death File, 1950–1977   Total observed Rank Name* Number Percent Formal Names   1 JOHN 187,486 5.30 2 WILLIAM 170,669 4.83 3 JAMES 111,513 3.16 4 JOSEPH 104,767 2.96 5 GEORGE 95,188 2.69 6 CHARLES 70,040 1.98 7 ROBERT 66,575 1.88 8 THOMAS 64,182 1.82 9 HENRY 55,718 1.61 10 EDWARD 55,837 1.68 11 ARTHUR 52,221 1.48 12 ALBERT 47,660 1.35 13 ALEXAND(ER) 38,343 1.09 14 FREDERI(CK) 36,864 1.04 15 DAVID 33,530 .95 16 ERNEST 32,041 .91 17 ALFRED 30,902 .87 18 FRANK 29,376 .83 19 PAUL 26,919 .76 20 PETER 26,889 .76 21 WALTER 26,718 .76 22 HARRY 24,830 .70 23 MICHAEL 24,645 .70 24 RICHARD 24,070 .68 25 LOUIS 23,860 .68 26 JEAN (male) 22,661 .64 27 FRANCIS 21,596 .61 28 HAROLD 21,588 .61 29 GORDON 19,158 .54 30 HERBERT 19,133 .54 31 SAMUEL 18,927 .54 32 ANDREW 18,440 .52 33 DONALD 17,416 .49 34 DANIEL 16,076 .46 35 STANLEY 14.575 .41 36 PATRICK 13,402 .38 37 NORMAN 13.270 .38 38 ROY 12,943 .37 39 RAYMOND 12,338 .35 40 EMILE 12,261 .35 41 HENRI 12.107 .34 42 KENNETH 12,076 .34 43 DOUGLAS 11,843 .34 44 LEONARD 10,978 .31 45 EUGENE 10,968 .31 46 VICTOR 10,797 .31 47 GEORGES 10,446 .30 48 ALLAN 10,384 .29 49 LEO 10,200 .30 50 EDWIN 10,156 .29 51 CLARENC(E) 9,974 .28 Informal Variants   1 FRED 7,947 .23 2 JACK 5,575 .16 3 ALEX 3,550 .10 4 MIKE 3,267 .10 5 SAM 2,014 .06 6 RAY 1,911 .056 7 TOM 990 .029 8 JOE 866 .025 9 DAN 781 .023 10 BILL 314 .009 11 PETE 265 .008 12 DON 240 .007 13 ANDY 220 .006 14 DAVE 179 .005 15 ED 43 .001

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Table 2. Pooling of Synonyms in Value-Specific Groups: Example Based on CHARLES Compared with KARL and Related Variants Value of name Numbers in File B* KARL 3,002 KARLA 1 KARLDON 1 KARLE 6 KARLEY 1 KARLHEI 2 KARLIE 1 KARLIOU 1 KARLIS 82 KARLMER 1 KARLO 36 KARLOFF 1 KARLOL 1 KARLOS 1 KARLS 2 KARLSEN 2 KARLSON 2 KARLSSO 1 KARLTON 1 KARLY 2 * Truncated at seven characters in the records of the Canadian mortality data base. * Based on an alphabetic listing from the death file. Of these names, only KARL was actually interchanged with CHARLES in the linked pairs of records. However, with CHARLES cannot be classed as full disagreements. 1,574,661 male given names). (For details, see Fair et al. 1990, 1991.) The data used in the current study are from the LINKED pairs of names containing any of the 51 most common given names in the death file or any of the 15 most common informal variants. These names are listed in Table 1, together with their counts and percentage frequencies in the death file. The 51 common names account for more than half (1,842,327/3,397,860) of all given names in the death records of males. Among 64,937 LINKED pairs of male given names, they were present 33,183 times on the Records A (25,673 as first names and 7,510 as second names) and 33,988 times on the Records B (26,536 as first names and 7,452 as second names), for a total of 67,171 times. A name pair that partially agrees may occur in either of two configurations, e.g., as FRANK-FRANCIS or as FRANCIS-FRANK, depending on which value comes from File A and which value comes from File B. Where two or more of the 51 names get interchanged with each other (as happens with HARRY, HENRI, and HENRY), some of the same information may be duplicated in a slightly different form within the tables. The 15 common informal variants represent less than 1% (28,164/3,397,860) of all given names in the death records of males. Among the 64,937 LINKED pairs of male given names, these were present 1,554 times on the Records A Table 3. Examples of Partial Agreements That Are Well Represented   Values* Numbers observed Rank x y Total N(Ax · By | LINK) N(Bx · Ay | LINK) 1. MICHAEL –MIKE 173 12 161 2. FREDERI –FRED 169 12 157 3. ALEXAND –ALEX 152 11 141 4. JOHN –JACK 90 23 67 5. FRANCIS –FRANK 73 19 54 6. JOSEPH –JOE 62 2 60 7. FREDERI –FREDRIC 52 28 24 8. ALLAN –ALLEN 47 28 19 9. HENRY –HENRI 44 40 4 10. SAMUEL –SAM 37 3 34 11. PETER –PETE 33 3 30 12. THOMAS –TOM 33 7 26 13. WILLIAM –WILLI 20 18 2 * Truncated at seven characters in the LINKS of Fair et al. (1991). Table 4. Examples of Partial Agreements That Are Not Well Represented Values* Total observed x y   ALBERT –ALBERTO 1 ARTHUR –ARTIMUS 1 DOUGLAS –DOUGLES 1 ERNEST –ERNES 1 HAROLD –HARLOD 1 LEO –LEODA 1 PETER –PEDER 1 VICTOR –VIATEUR 1 ALBERT –ALBERTS 0 ARTHUR –ARTIMON 0 DOUGLAS –DOUGLIS 0 ERNEST –ERNE 0 HAROLD –HARLOE 0 LEO –LEODAS 0 PETER –PEDAR 0 VICTOR –VIATIAR 0 * Truncated at seven characters in the LINKS of Fair et al. (1991). (1,426 as first names and 128 as second names) and 701 times on the Records B (633 as first names and 68 as second names), for a total of 2,255 times.x Application of the linkage rationale to outcomes defined in wholly value-specific ways depends on more than just the ODDS formula for its success. The chief obstacle is created by the many value pairs that are rare in the available LINKS, plus the even more numerous possible ones that have not been observed at all. Grouping is necessary, but must be based on wholly value-specific group definitions. The roles played in the process by Files A and B and the LINKS are illustrated in Tables 2–5. Group definitions are based on selected blocks of names in alphabetic listings, chosen to bring rare synonyms into the same groups with common forms (Table 2). Comparison pairs that are common in the LINKS present no special problem (Table 3). However, possible pairs that are rare or absent in the available LINKS need to be grouped with others that are more common (Table 4). ODDS are calculated for specific name pairs and for specific groups as a whole, using expression 1.4 (Table 5). (For details see Fair et al. 1990, 1991.) There are no rules explicitly stating how the boundaries of the groups should be determined, except that variants known to yield widely different ODDS on their own should not be put into the same group. Apart from this, the process is unavoidably subjective—but it is far from entirely arbitrary. In particular, it is greatly aided by strong impressions gained while perusing alphabetical listings of names from Files A and B. 3. APPLICATION: REFINEMENTS AND SHORTCUTS Many choices have had to be made in the past between shortcuts in the way the ODDS are calculated versus corresponding refinements in which the shortcuts are not used. Such choices are inescapable, but only rarely have their effects on the calculated ODDS been quantified. Indeed, where data to support the more refined alternative were lacking, the comparison often was not possible. But now the extensive data from large files of LINKS accumulated at Statistics Canada make it attractive to assess the effects on discriminating power when people 's names are compared in alternative ways.

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Table 5. Comparison Outcomes for the Given Name GEORGE, With Examples of Possible Groupings Values* Total outcomes ODDS x y   Full Agreement   GEORGE–GEORGE 3,130 89.7/1 Partial Agreement   GEORGE–GEO 6 87.9/1 GEORGE–GEOR to GEORGDZ (including GEORDIE) 11 14.9/1 GEORGE–GEORGES 28 12.1/1 GEORGE–GEORGET to GEORGZ (including GEORGIO) 3 21.6/1 Other (Including disagreements)   GEORGE–G* (*=other; few synonyms) 16 1/5.6 GEORGE–non-G (full disagreements) 175 1/13.2 * Data for Ax·By and Bx·Ay are pooled. We consider here six shortcuts (and their corresponding refinements): Use of the simplified formula (see expression 1.5) Pooling of first and second given names, to reduce the number of look-up tables of the value-specific frequencies, N(By)/N(B), when using the simplified formula Use of a wholly versus a partially global term in the numerator of the simplified formula when calculating ODDS for the various levels of outcome (i.e., both Ax and By being nonspecific in the LINKS, versus Ax being specified as equal, successively, to each of the 51 common names) Not updating the global ODDS Recognizing the specificities of just the agreement portions of names that only partially agree Pooling complementary partial agreements (e.g., Ax·By = MICHAEL-MIKE, plus Ay·Bx = MIKE-MICHAEL). Past and current practices with regard to these shortcuts are reviewed elsewhere (Hill 1981; Howe and Lindsay 1981; Newcombe 1988). The importance of a given refinement as compared with its corresponding shortcut is assessed by comparing the ODDS when calculated in the two ways. The ratios of the two ODDS will be termed “error factors ” or “correction factors.” These factors vary for different names as represented in File A (e.g., the given name JOHN) and for different comparison outcomes (e.g., JOHN-JACK). One such type of “correction factor” is defined in the second part of expression 1.5. Its use as part of the full expression constitutes a refinement, its omission constitutes a shortcut, and its use on its own reveals the factor difference between the ODDS as obtained in the two ways. Comparisons between different refinement/shortcut choices may be based either on the frequency distributions of the error levels, as defined earlier, or on the median and maximum error factors. Sometimes a combination of the two may be appropriate. Data from the six types of comparisons are presented in Figure 2 (parts a to f) and Table 6 (lines 1 to 6). The histograms in Figure 2 are appropriately weighted throughout; for example, in part a of Figure 2 by the frequencies of the names in File A. The magnitudes of such error factors may vary with the particular name or linkage project; that is, forming a distribution of error factors as shown in Figure 2. The log error factor approach, with base 2, is used in this Figure. (Log error factor = 1 indicates a difference by a factor of 2, log error factor = 2 indicates a difference by a factor of 4, and so on.) Because we are dealing with a spectrum of error factors and need to divide it into discrete levels, we have recognized central values of 1, 2, 4, 8, 16, and so on (equivalent to logs to the base 2 = 0, 1, 2, 3, 4, and so on). Standard founding of the logs is used to assign the appropriate central values. 3.1 Ranking the Choices The effect of choosing a shortcut, or its corresponding refinement, is best seen in a listing of the associated error factors in descending order. These create in the mind a compelling picture. What they teach us is that the feedback of actual data does away with the need for guesswork. For our current purposes it is sufficient that the results of the tests be summarized (Fig. 2, Table 6) and that examples be given. Use of the simplified formula, for example, results in error factors as high as 6.4, with 13% of the 34,737 comparisons associated with the four-fold level of error. Nine of the 51 common names and 5 of the 15 informal names are involved (i.e., DOUGLAS, ERNEST, EMILE, FRANK, HAROLD, CLARENCE, ALFRED, HERBERT, HARRY, FRED, PETE, MIKE, SAM, ALEX). Similarly modest error factors result from pooling of first plus second names, use of a wholly global numerator, and pooling complementary partial agreements. In these examples the magnitudes of the error factors vary with the values of the given names. The effects of not updating the ODDS differ in that the error factors vary with the quality of the files used to initiate the death searches and, therefore, with the particular linkage study. Error factors are greater for the partial agreements than for the full agreements and disagreements, independent of the actual values of the names; for this reason, only the partial agreements are considered here. Again, the effects of the shortcut are modest. The largest are associated with search files (Files A) in which the quality of the identifiers differed most widely from the average; that is, were either much better

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Table 2. —De-identified Data that Are not Anonymous ZIP Code Birthdate Gender Ethnicity 33171 7/15/71 m Caucasian 02657 2/18/73 f Black 20612 3/12/75 m Asian Most towns and cities sell locally collected census data or voter registration lists that include the date of birth, name and address of each resident. This information can be linked to medical microdata that include a date of birth and ZIP code, even if the names, social security numbers and addresses of the patients are not present. Of course, local census data are usually not very accurate in college towns and areas that have a large transient community, but for much of the adult population in the United States, local census information can be used to re-identify de-identified microdata since other personal characteristics, such as gender, date of birth, and ZIP code, often combine uniquely to identify individuals. The 1997 voting list for Cambridge, Massachusetts contains demographics on 54,805 voters. Of these, birth date alone can uniquely identify the name and address of 12% of the voters. We can identify 29% by just birth date and gender, 69% with only a birth date and a 5-digit ZIP code, and 97% (53,033 voters) when the full postal code and birth date are used. These values are listed in Table 3. Clearly, the risks of re-identifying data depend both on the content of the released data and on related information available to the recipient. Table 3. —Uniqueness of Demographic Fields in Cambridge Voter List Birth date alone 12% birth date and gender 29% birth date and 5-digit ZIP 69% birth date and full postal code 97% A second problem with producing anonymous data concerns unique and unusual information appearing within the data themselves. Instances of uniquely occurring characteristics found within the original data can be used by reporters, private investigators and others to discredit the anonymity of the released data even when these instances are not unique in the general population. Also, unusual cases are often unusual in other sources of data as well making them easier to identify. Consider the database shown in Table 4. It is not surprising that the social security number is uniquely identifying, or given the size of the database, that the birth date is also unique. To a lesser degree the ZIP codes in Table 4 identify individuals since they are almost unique for each record. Importantly, what may not have been known without close examination of the particulars of this database is that the designation of Asian as a race is uniquely identifying. During an interview, we could imagine that the janitor, for example, might recall an Asian patient whose last name was Chan and who worked as a stockbroker for ABC Investment since the patient had given the janitor some good investing tips.

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Table 4. —Sample Database in which Asian is A Uniquely Identifying Characteristic SSN Ethnicity Birth Sex ZIP 819491049 Caucasian 10/23/64 m 02138 749201844 Caucasian 03/15/65 m 02139 819181496 Black 09/20/65 m 02141 859205893 Asian 10/23/65 m 02157 985820581 Black 08/24/64 m 02138 Any single uniquely occurring value or group of values can be used to identify an individual. Consider the medical records of a pediatric hospital in which only one patient is older than 45 years of age. Or, suppose a hospital's maternity records contained only one patient who gave birth to triplets. Knowledge of the uniqueness of this patient 's record may appear in many places including insurance claims, personal financial records, local census information, and insurance enrollment forms. Remember that the unique characteristic may be based on diagnosis, treatment, birth year, visit date, or some other little detail or combination of details available to the memory of a patient or a doctor, or knowledge about the database from some other source. Measuring the degree of anonymity in released data poses a third problem when producing anonymous data for practical use. The Social Security Administration (SSA) releases public-use files based on national samples with small sampling fractions (usually less than 1 in 1,000); the files contain no geographic codes, or at most regional or size of place designators (Alexander et al., 1978). The SSA recognizes that data containing individuals with unique combinations of characteristics can be linked or matched with other data sources. So, the SSA's general rule is that any subset of the data that can be defined in terms of combinations of characteristics must contain at least 5 individuals. This notion of a minimal bin size, which reflects the smallest number of individuals matching the characteristics, is quite useful in providing a degree of anonymity within data. The larger the bin size, the more anonymous the data. As the bin size increases, the number of people to whom a record may refer also increases, thereby masking the identity of the actual person. In medical databases, the minimum bin size should be much larger than the SSA guidelines suggest. Consider these three reasons: most medical databases are geographically located and so one can presume, for example, the ZIP codes of a hospital's patients; the fields in a medical database provide a tremendous amount of detail and any field can be a candidate for linking to other databases in an attempt to re-identify patients; and, most releases of medical data are not randomly sampled with small sampling fractions, but instead include most if not all of the database. Determining the optimal bin size to ensure anonymity is tricky. It certainly depends on the frequencies of characteristics found within the data as well as within other sources for re-identification. In addition, the motivation and effort required to re-identify released data in cases where virtually all possible candidates can be identified must be considered. For example, if we release data that maps each record to 10 possible people and the 10 people can be identified, then all 10 candidates may even be contacted or visited in an effort to locate the actual person. Likewise, if the mapping is 1 in 100, all 100 could be phoned since visits may then be impractical, and in a mapping of 1 in 1000, a direct mail campaign could be employed. The amount of effort the recipient is willing to spend depends on their motivation. Some medical files are quite valuable, and valuable data will merit more effort. In these cases, the minimum bin size must be further increased or the sampling fraction reduced to render these efforts useless.

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Of course, the expression of anonymity most semantically consistent with our intention is simply the probability of identifying a person given the released data and other possible sources. This conditional probability depends on frequencies of characteristics (bin sizes) found within the data and the outside world. Unfortunately, this probability is very difficult to compute without omniscience. In extremely large databases like that of SSA, the database itself can be used to compute frequencies of characteristics found in the general population since it contains almost all the general population; small, specialized databases, however, must estimate these values. In the next section, we will present a computer program that generalizes data based on bin sizes and estimates. Following that, we will report results using the program and discuss its limitations. Methods Earlier this year, Sweeney presented the Datafly System (1997) whose goal is to provide the most general information useful to the recipient. Datafly maintains anonymity in medical data by automatically aggregating, substituting and removing information as appropriate. Decisions are made at the field and record level at the time of database access, so the approach can be incorporated into role-based security within an institution as well as in exporting schemes for data leaving an institution. The end result is a subset of the original database that provides minimal linking and matching of data since each record matches as many people as the user had specified. Diagram 1 provides a user-level overview of the Datafly System. The original database is shown on the left. A user requests specific fields and records, provides a profile of the person who is to receive the data, and requests a minimum level of anonymity. Datafly produces a resulting database whose information matches the anonymity level set by the user with respect to the recipient profile. Notice how the record containing the Asian entry was removed; social security numbers were automatically replaced with made-up alternatives; and birth dates were generalized to the year, and ZIP codes to the first three digits. In the next three paragraphs we examine the overall anonymity level and the profile of the recipient, both of which the user provides when requesting data. Diagram 1. —The Input to the Datafly System is the Original Database and Some User Specifications, and the Output is a Database Whose Fields and Records Correspond to the Anonymity Level Specified by the User, in this Example, 0.7.   User -fields & records -recipient profile -anonymity 0.7   Original Medical Database   Resulting Database, anonymity 0.7 SSN Race Birth Sex ZIP Datafly SSN Race Birth Sex ZIP 819491049 Caucasian 10/23/64 m 02138   444444444 Caucasian 1964 m 02100 749201844 Caucasian 03/15/65 m 02139   555555555 Caucasian 1965 m 02100 819181496 Black 09/20/65 m 02141   333333333 Black 1965 m 02100 859205893 Asian 10/23/65 m 02157   222222222 Black 1964 m 02100 985820581 Black 08/24/64 m 02138  

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition The overall anonymity level is a number between 0 and 1 that specifies the minimum bin size for every field. An anonymity level of 0 provides the original data, and a level of 1 forces Datafly to produce the most general data possible given the profile of the recipient. All other values of the overall anonymity level between 0 and 1 determine the minimum bin size b for each field. (The institution is responsible for mapping the anonymity level to actual bin sizes though Sweeney provides some guidelines.) Information within each field is generalized as needed to attain the minimum bin size; outliers, which are extreme values not typical of the rest of the data, may be removed. When we examine the resulting data, every value in each field will occur at least b times with the exception of one-to-one replacement values, as is the case with social security numbers. Table 5 shows the relationship between bin sizes and selected anonymity levels using the Cambridge voters database. As A increased, the minimum bin size increased, and in order to achieve the minimal bin size requirement, values within the birth date field, for example, were re-coded as shown. Outliers were excluded from the released data and their corresponding percentages of N are noted. An anonymity level of 0.7, for example, required at least 383 occurrences of every value in each field. To accomplish this in the birth date field, dates were re-coded to reflect only the birth year. Even after generalizing over a 12 month window, the values of 8% of the voters still did not meet the requirement so these voters were dropped from the released data. Table 5. —Anonymity Generalizations for Cambridge Voters Data with Corresponding Bin Sizes* Anonymity BinSize BirthDate Drop% 1.0   .9 493 24 4% .8 438 24 2% .7 383 12 8% .6 328 12 5% .5 274 12 4% .4 219 12 3% .3 164 6 5% .2 109 4 5% .1 54 2 5% 0.0   * The birth date generalizations (in months) required to satisfy the minimum bin size are shown and the percentages of the total database dropped due to outliers is displayed. The user sets the anonymity level as depicted above by the slide bar at the 0.7 selection. The mappings of anonymity levels to bin sizes is determined by the instittion. In addition to an overall anonymity level, the user also provides a profile of the person who receives the data by specifying for each field in the database whether the recipient could have or would use information external to the database that includes data within that field. That is, the user estimates on which fields the recipient might link outside knowledge. Thus each field has associated with it a profile value between 0 and 1, where 0 represents full trust of the recipient or no concern over the sensitivity of the information within the field, and 1 represents full distrust of the recipient or maximum concern over the sensitivity of the field's contents. The role of these profile values is to restore the effective bin size by forcing these fields to adhere to bin sizes larger than the overall anonymity level warranted. Semantically related sensitive fields, with the exception of one-to-one

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition replacement fields, are treated as a single concatenated field which must meet the minimum bin size, thereby thwarting linking attempts that use combinations of fields. Consider the profiles of a doctor caring for a patient, a clinical researcher studying risk factors for heart disease and a health economist assessing the admitting patterns of physicians. Clearly, these profiles are all different. Their selection and specificity of fields are different; their sources of outside information on which they could link are different; and, their uses for the data are different. From publicly available birth certificate, driver license, and local census databases, the birth dates, ZIP codes and gender of individuals are commonly available along with their corresponding names and addresses; so these fields could easily be used for re-identification. Depending on the recipient, other fields may be even more useful, but we will limit our example to profiling these fields. If the recipient is the patient's caretaker within the institution, the patient has agreed to release this information to the care-taker, so the profile for these fields should be set to 0 to give the patient's caretaker full access to the original information. When researchers and administrators make requests that do not require the most specific form of the information as found originally within sensitive fields, the corresponding profile values for these fields should warrant a number as close to 1 as possible but not so much so that the resulting generalizations do not provide useful data to the recipient. But researchers or administrators bound by contractual and legal constraints that prohibit their linking of the data are trusted, so if they make a request that includes sensitive fields, the profile values would ensure that each sensitive field adheres only to the minimum bin size requirement. The goal is to provide the most general data that are acceptably specific to the recipient. Since the profile values are set independently for each field, particular fields that are important to the recipient can result in smaller bin sizes than other requested fields in an attempt to limit generalizing the data in those fields; a profile for data being released for public use, however, should be 1 for all sensitive fields to ensure maximum protection. The purpose of the profile is to quantify the specificity required in each field and to identify fields that are candidates for linking; and in so doing, the profile identifies the associated risk to patient confidentiality for each release of data. Results Numerous tests were conducted using the Datafly System to access a pediatric medical record system (Sweeney, 1997). Datafly processed all queries to the database over a spectrum of recipient profiles and anonymity levels to show that all fields in medical records can be meaningfully generalized as needed since any field can be a candidate for linking. Of course, which fields are most important to protect depends on the recipient. Diagnosis codes have generalizations using the International Classification of Disease (ICD-9) hierarchy. Geographic replacements for states or ZIP codes generalize to use regions and population size. Continuous variables, such as dollar amounts and clinical measurements, can be treated as categorical values; however, their replacements must be based on meaningful ranges in which to classify the values; of course this is only done in cases where generalizing these fields is necessary. The Group Insurance Commission in Massachusetts (GIC) is responsible for purchasing insurance for state employees. They collected encounter-level de-identified data with more than 100 fields of information per encounter, including the fields in Table 1 for approximately 135,000 patients consisting of state employees and their families (Lasalandra, 1997). In a public hearing, GIC reported giving a copy of the data to a researcher, who in turn stated she did not need the full date of birth, just the birth year. The average bin size based only on birth date and gender for that population is 3, but had the researcher received only the year of birth in the birth date field, the average bin size based on birth year and gender would have increased to 1125 people. It is estimated that most of this data could be re-identified since collected fields also included residential ZIP codes and city, occupational department or agency, and provider information. Furnishing the most general information the recipient can use minimizes unnecessary risk to patient confidentiality.

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Comparison to µ-ARGUS In 1996, The European Union began funding an effort that involves statistical offices and universities from the Netherlands, Italy and the United Kingdom. The main objective of this project is to develop specialized software for disclosing public-use data such that the identity of any individual contained in the released data cannot be recognized. Statistics Netherlands has already produced, but has not yet released, a first version of a program named µ-Argus that seeks to accomplish this goal (Hundepool, et al., 1996). The µ-Argus program is considered by many as the official confidentiality software of the European community even though Statistics Netherlands admittedly considers this first version a rough draft. A presentation of the concepts on which µ-Argus is based can be found in Willenborg and De Waal (1996). The program µ-Argus, like the Datafly System, makes decisions based on bin sizes, generalizes values within fields as needed, and removes extreme outlier information from the released data. The user provides an overall bin size and specifies which fields are sensitive by assigning a value between 0 and 3 to each field. The program then identifies rare and therefore unsafe combinations by testing 2- or 3-combinations across the fields noted by the user as being identifying. Unsafe combinations are eliminated by generalizing fields within the combination and by local cell suppression. Rather than removing entire records when one or more fields contain outlier information, as is done in the Datafly System, the µ-Argus System simply suppresses or blanks out the outlier values at the cell-level. The resulting data typically contain all the rows and columns of the original data though values may be missing in some cell locations. In Table 6a there are many Caucasians and many females, but only one female Caucasian in the database. Tables 6b and 6c show the resulting databases when the Datafly System and the µ-Argus System were applied to this data. We will now step through how the µ-Argus program produced the results in Table 6c. Table 6a. —There is Only One Caucasian Female, Even Though There are Many Females and Caucasians SSN Ethnicity Birth Sex ZIP Problem 819181496 Black 09/20/65 m 02141 shortness of breath 195925972 Black 02/14/65 m 02141 chest pain 902750852 Black 10/23/65 f 02138 hypertension 985820581 Black 08/24/65 f 02138 hypertension 209559459 Black 11/07/64 f 02138 obesity 679392975 Black 12/01/64 f 02138 chest pain 819491049 Caucasian 10/23/64 m 02138 chest pain 749201844 Caucasian 03/15/65 f 02139 hypertension 985302952 Caucasian 08/13/64 m 02139 obesity 874593560 Caucasian 05/05/64 m 02139 shortness of breath 703872052 Caucasian 02/13/67 m 02138 chest pain 963963603 Caucasian 03/21/67 m 02138 chest pain

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Table 6b. —Results from Applying the Datafly System to the Data in Table 6a* SSN Ethnicity Birth Sex ZIP Problem 902387250 Black 1965 m 02140 shortness of breath 197150725 Black 1965 m 02140 chest pain 486062381 Black 1965 f 02130 hypertension 235978021 Black 1965 f 02130 hypertension 214684616 Black 1964 f 02130 obesity 135434342 Black 1964 f 02130 chest pain 458762056 Caucasian 1964 m 02130 chest pain 860424429 Caucasian 1964 m 02130 obesity 259003630 Caucasian 1964 m 02130 shortness of breath 410968224 Caucasian 1967 m 02130 chest pain 664545451 Caucasian 1967 m 02130 chest pain * The minimum bin size is 2. The given profile identifies only the demographic fields as being likely for linking. The data are being made available for semi-public use so the Caucasian female record was dropped as outlier. Table 6c. —Results from Applying the Approach of the µ-Argus System to the Data in Table 6a* SSN Ethnicity Birth Sex ZIP Problem   Black 1965 m 02141 shortness of breath   Black 1965 m 02141 chest pain   Black 1965 f 02138 hypertension   Black 1965 f 02138 hypertension   Black 1964 f 02138 obesity   Black 1964 f 02138 chest pain   Caucasian 1964 m 02138 chest pain       f 02139 hypertension   Caucasian 1964 m 02139 obesity   Caucasian 1964 m 02139 shortness of breath   Caucasian 1967 m 02138 chest pain   Caucasian 1967 m 02138 chest pain * The minimum bin size is 2. SSN was marked as being most identifying, the birth, sex, and ZIP fields were marked as being more identifying, and the ethnicity field was simply marked as identifying. Combinations across these were examined; the resulting suppressions are shown. The uniqueness of the Caucasian female is suppressed; but, there still remains a unique record for the Caucasian male born in 1964 that lives in the 02138 ZIP code.

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition The first step is to check that each identifying field adheres to the minimum bin size. Then, pairwise combinations are examined for each pair that contains the “most identifying” field (in this case, SSN) and those that contain the “more identifying” fields (in this case, birth date, sex and ZIP). Finally, 3-combinations are examined that include the “most” and “more” identifying fields. Obviously, there are many possible ways to rate these identifying fields, and unfortunately different identification ratings yield different results. The ratings presented in this example produced the most secure result using the µ-Argus program though admittedly one may argue that too many specifics remain in the data for it to be released for public use. The value of each combination is basically a bin, and the bins with occurrences less than the minimum required bin size are considered unique and termed outliers. Clearly for all combinations that include the SSN, all such combinations are unique. One value of each outlier combination must be suppressed. For optimal results, the µ-Argus program suppresses values which occur in multiple outliers where precedence is given to the value occurring most often. The final result is shown in Table 6c. The responsibility of when to generalize and when to suppress hes with the user. For this reason, the µ-Argus program operates in an interactive mode so the user can see the effect of generalizing and can then select to undo the step. We will briefly compare the results of these two systems, but for a more in-depth discussion, see Sweeney (1997). The µ-Argus program checks at most 2- or 3-combinations of identifying fields, but not all 2- or 3-combinations are necessarily tested. Even if they were, there may exist unique combinations across 4 or more fields that would not be detected. For example, Table 6c still contains a unique record for a Caucasian male born in 1964 that lives in the 02138 ZIP code, since there are 4 characteristics that combine to make this record unique, not 2. Treating a subset of identifying fields as a single field that must adhere to the minimum bin size, as done in the Datafly System, appears to provide more secure releases ofnicrodata. Discussion The Datafly and µ-Argus systems illustrate that medical information can be generalized so that fields and combinations of fields adhere to a minimal bin size, and by so doing, confidentiality can be maintained. Using such schemes we can even provide anonymous data for public use. There are two drawbacks to these systems but these shortcomings may be counteracted by policy. One concern with both µ-Argus and Datafly is the determination of the proper bin size and its corresponding measure of disclosure risk. There is no standard which can be applied to assure that the final results are adequate. What is customary is to measure risk against a specific compromising technique, such as linking to known databases, that we assume the recipient is using. Several researchers have proposed mathematical measures of the risk which compute the conditional probability of the linker's success (Duncan, et al., 1987). A policy could be mandated that would require the producer of data released for public use to guarantee with a high degree of confidence that no individual within the data can be identified using demographic or semi-public information. Of course, guaranteeing anonymity in data requires a criterion against which to check resulting data and to locate sensitive values. If this is based only on the database itself, the minimum bin sizes and sampling fractions may be far from optimal and may not reflect the general population. Researchers have developed and tested several methods for estimating the percentage of unique values in the general population based on a smaller database (Skinner, et al., 1992). These methods are based on subsampling techniques and equivalence class structure. In the absence of these techniques, uniqueness in the population based on demographic fields can be determined using population registers that include patients from the database, such as local census data, voter registration lists, city directories, as well as information from motor vehicle agencies, tax as-

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition sessors and real estate agencies. To produce an anonymous database, a producer could use population registers to identify sensitive demographic values within a database, and thereby obtain a measure of risk for the release of the data. The second drawback with the µ-Argus and Datafly systems concerns the dichotomy between researcher needs and disclosure risk. If data are explicitly identifiable, the public would expect patient consent to be required. If data are released for public use, then the producer should guarantee, with a high degree of confidence, that the identity of any individual cannot be determined using standard and predictable methods and reasonably available data. But when sensitive de-identified, but not necessarily anonymous, data are to be released, the likelihood that an effort will be made to re-identify an individual increases based on the needs of the recipient, so any such recipient has a trust relationship with society and the producer of the data. The recipient should therefore be held accountable. The Datafly and µ-Argus systems quantify this trust by profiling the fields requested by the recipient. But recall that profiling requires guesswork in identifying fields on which the recipient could link. Suppose a profile is incorrect; that is, the producer misjudges which fields are sensitive for linking. In this case, these systems might release data that are less anonymous than what was required by the recipient, and as a result, individuals may be more easily identified. This risk cannot be perfectly resolved by the producer of the data since the producer cannot always know what resources the recipient holds. The obvious demographic fields, physician identifiers, and billing information fields can be consistently and reliably protected. However, there are too many sources of semi-public and private information such as pharmacy records, longitudinal studies, financial records, survey responses, occupational lists, and membership lists, to account a priori for all linking possibilities. Table 7. —Contractual Requirements for Restricted-Use of Data Based on Federal Guidelines and the Datafly System There must be a legitimate and important research or administrative purpose served by the release of the data. The recipient must identify and explain which fields in the database are needed for this purpose. The recipient must be strictly and legally accountable to the producer for the security of the data and must demonstrate adequate security protection. The data must be de-identified. It must contain no explicit individual identifiers nor should it contain data that would be easily associated with an individual. Of the fields the recipient requests, the recipient must identify which of these fields, during the specified lifetime of the data, the recipient could link to other data the recipient will have access to, whether the recipient intends to link to such data or not. The recipient must identify those fields for which the recipient will link the data. The provider should have the opportunity to review any publication of information from the data to insure that no potential disclosures are published. At the conclusion of the project and no later than some specified date, the recipient must destroy all copies of the data. The recipient must not give, sell, loan, show, or disseminate the data to any other parties. What is needed is a contractual arrangement between the recipient and the producer to make the trust explicit and share the risk. Table 7 contains some guidelines that make it clear which fields need to be protected against linking since the recipient is required to provide such a list. Using this additional knowledge and the techniques presented in the Datafly System, the producer can best protect the anonymity of patients in data even when the data are more detailed than data for public-use. Since the harm to individuals can be extreme and irreparable and can occur without the individual's knowledge, the penalties for abuses must be stringent. Signifi-

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition cant sanctions or penalties for improper use or conduct should apply since remedy against abuse lies outside the Datafly System and resides in contracts, laws and policies. Acknowledgments The author acknowledges Beverly Woodward, Ph.D., for many discussions, and thanks Patrick Thompson, for editorial suggestions. The author also acknowledges the continued support of Henry Leitner and Harvard University DCE. This work has been supported by a Medical Informatics Training Grant (1 T15 LM07092) from the National Library of Medicine. References Alexander, L. and Jabine, T. ( 1978). Access to Social Security Microdata Files for Research and Statistical Purposes, Social Security Bulletin, (41), 8. Duncan, G. and Lambert, D. ( 1987). The Risk of Disclosure for Microdata, Proceedings of the Bureau of the Census Third Annual Research Conference, Washington, D.C.: Bureau of the Census. Hundepool, A. and Willenborg, L. ( 1996). µ- and t-ARGUS: Software for Statistical Disclosure Control, Third International Seminar on Statistical Confidentiality, Bled. Lasalandra, M. ( 1997). Panel Told Releases of Medical Records Hurt Privacy, Boston Herald, Boston, (35). National Association of Health Data Organizations. ( 1996). A Guide to State-Level Ambulatory Care Data Collection Activities, Falls Church, VA. Skinner, C. and Holmes, D. ( 1992). Modeling Population Uniqueness, Proceedings of the International Seminar on Statistical Confidentiality, International Statistical Institute, 175–199. Sweeney, L. ( 1997). Guaranteeing Anonymity When Sharing Medical Data, The Datafly System , MIT Artificial Intelligence Laboratory Working Paper, Cambridge, 344. Willenborg, L. and De Waal, T. ( 1996). Statistical Disclosure Control in Practice, New York: Springer-Verlag. Woodward, B. ( 1996). Patient Privacy in a Computerized World, 1997 Medical and Health Annual 1997, Chicago: Encyclopedia Britannica, Inc., 256–259.

OCR for page 333
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition This page in the original is blank.