Chapter 12

Tutorial on Record Linkage

Authors:

Martha E.Fair and Patricia Whitridge, Statistics Canada



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Chapter 12 Tutorial on Record Linkage Authors: Martha E.Fair and Patricia Whitridge, Statistics Canada

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition This page in the original is blank.

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Tutorial on Record Linkage Slides Presentation Martha E.Fair and Patricia Whitridge, Statistics Canada

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Glossary of Terms There are various terms used in record linkage. Some of these have been defined in: Newcombe, H.B. (1988). Handbook of Record Linkage Methods for Health and Statistical Studies, Administration and Business. Oxford, U.K. Oxford University Press, pp. 103–106. The terms used in that book are as follows: Blocking. —The use of sequencing information (e.g., the phonetically coded versions of the surnames) to divide the files into “pockets.” Normally, records are only compared with each other where they are from the same “pocket, ” i.e., have identical blocking information. The purpose is to avoid having to compare the enormous numbers of record pairs that would be generated if every record in the file initiating the searches were allowed to pair with every record in the file being searched. Denominator. —This usually refers to the denominator in a FREQUENCY RATIO, i.e., the frequency of a given comparison outcome among UNLINKABLE pairs of records brought together at random. It may be applied also to one of the two components of any ODDS. Frequency Ratio. —The frequency of a given comparison outcome among correctly LINKED pairs of records, divided by the corresponding frequency among UNLINKABLE pairs brought together at random. The comparison outcome may be defined in any way, for example as a full agreement, a partial agreement, a more extreme disagreement, or any combination of values from the two records that are being compared. The FREQUENCY RATIO may be specific for the particular value of an identifier when it agrees, or for the value of the agreement portion of an identifier that partially agrees, or it may be non-specific for value. General Frequency. —A weighted mean of the frequencies of the various values of an identifier among the individual (i.e., unpaired) records of the file being searched. It is non-specific for value. Value-specific frequencies are also obtained from the same source. Global Frequency. —The frequency of a comparison outcome among pairs of records, when that outcome is defined in terms that are non-specific for the value of the identifier. The outcome may be a full agreement, a partial agreement, or a more extreme disagreement. The record pairs may be those of a LINKED file, or they may be UNLINKABLE pairs brought together at random. Only in the special case of the full agreement outcomes are the global and the general frequencies numerically equal, but they always remain conceptually different. The difference is that a global frequency, although value non-specific, always reflects the full definition of the non-agreement portion of that definition. A general frequency cannot do this because it is based on a file of single (i.e., unpaired) records. Global Frequency Ratio. —The ratio of the global frequency for a particular comparison outcome among LINKED pairs of records, divided by the corresponding frequency among UNLINKABLE pairs. It is equivalent to the global ODDS. GLOBAL FREQUENCY RATIOS for agreement outcomes and partial agreement outcomes are often subsequently converted to this value-specific counterpart during the linkage process. The conversion is accomplished by means of an adjustment upwards where the agreement portion of the identifier has a rare value, and an adjustment downwards where the value is common.

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Linkage. —In its broadest sense, RECORD LINKAGE is the bringing together of information from two or more records that are believed to relate to the same “entity.” For an economic or social study, the “entities” in question might be farms or businesses. For a health study, the “entities” of special interest are usually individual people or families. It is in the latter sense that the word is used throughout this book. Linked. —In line with the above definition of “record linkage,” LINKED pairs of records are pairs believed to relate to the same individual or family (or other kind of entity). Record pairs brought together and judged not to relate to the same individual or family may be referred as “UNLINKABLE” pairs. For short, the two sorts of pairs are sometimes called “LINKS” and “NON-LINKABLE,” respectively. As used here, the term implies that some sort of decision has been reached concerning the likely correctness of the match. Matched. —This word is variously used in the literature on record linkage. In this book, however, it is given no special technical meaning and merely implies a pairing of records on the basis of some stated similarity (or dissimilarity). For example, early in a linkage operation, records from the two files being LINKED are normally matched for agreement of the surname code. The resulting pairs may also be called “candidate pairs” for linkage, but this emphasis is most appropriate in the later stages when the numbers of competing pairs have diminished. Pairs of records will frequently be spoken of as “correctly matched, ” “falsely matched,” or “randomly matched.” Numerator. —This usually refers to the numerator in a FREQUENCY RATIO, i.e., the frequency of a given comparison outcome among pairs of records believed to be correctly LINKED. It may be applied also to one of the two components of any ODDS. Odds. —This word is used in its ordinary sense but is applied in a number of situations. As relating to a particular outcome from the comparison of a given identifier it is synonymous with the FREQUENCY RATIO for that outcome. As relating to the accumulated FREQUENCY RATIOS for a given record pair it refers to the overall RELATIVE ODDS. It is also applied to the overall ABSOLUTE ODDS. Outcome. —This refers to any outcome or result from the comparison of a particular identifier (or concatenated identifiers) on a pair of records, or the comparison of a particular identifier on one record with a different but logically related identifier on the other. It may be defined in almost any way, for example as an AGREEMENT, a PARTIAL AGREEMENT, a more extreme DISAGREEMENT, any other SIMILARITY or DISSIMILARITY, or the absence of an identifier on one record a s compared with its presence or absence on the other. An outcome may be specific for a particular value of an identifier (e.g., as it appears on the search record) or for any part of that identifier, especially where there is an agreement or partial agreement; it may be non-specific for value; or it may even be specific for a particular king of DISAGREEMENT defined in terms of any pair of values being compared. Value. —An identifier (e.g., an initial) may be said to have a number of different “values” (e.g., initial “A,” initial “B,” and so on). Surnames, given names, and places of birth have many possible values. Other identifiers tend to have fewer values that need to be distinguished from each other.

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition Weight. —In the literature, this term has been widely applied to the logarithms of various entities, such as: a FREQUENCY RATIO for a specified outcome from the comparison of a given identifier; the product of all the FREQUENCY RATIOS for a given record pair; the NUMERATOR of a particular FREQUENCY RATIO; the DENOMINATOR of a particular FREQUENCY RATIO; any estimate of such a numerator or denominator, not obtained directly from a file of matched pairs of records. The use of the logarithm is merely a convenience when doing the arithmetic; it does no affect the logic except to make it appear more complicated. The term “WEIGHT” has therefore been employed sparingly in this book. Instead, reference has been made directly to the source frequency or FREQUENCY RATIO, or to the estimates of these, wherever possible.

OCR for page 455
Record Linkage Techniques—1997: Proceedings of an International Workshop and Exposition This page in the original is blank.