• Pass 3: first three characters of surname and first three characters of first name. This pass accounts for any errors at all in the birth date (such as mistranscribed year of birth and mistakenly exchanged day of birth and month of birth).

In practice, the ordering of these passes matters. After Pass 1 is completed, all of the matching pairs above the relevant threshold score (indicating a match) are removed from the dataset and Pass 2 is performed on the remaining data. (More precisely, if a given pair exceeding threshold is composed of “name-a” and “name-b,” name-a is removed from File A and name-b is removed from File B. This process is repeated for all matching pairs, and then Pass 2 is performed on the reduced File A and File B.) Pass 3 operates on a similarly reduced File A and File B, except that these reduced files do not include names found in pairs that matched on Pass 2 and Pass 1.

At the end of these multiple passes, all of the pairs exceeding the relevant threshold for one of the blocking criteria are considered matches. In general, the number of such pairs will be larger—sometimes substantially larger—than the number of pairs that would result if the matching criterion simply specified an exact match on first name, last name, middle initial, and date of birth.

Other technical approaches to blocking and string comparators can be found in Fienberg et al.28

MATCHING RECORDS USING THIRD-PARTY DATA

The use of blocking and string comparators is likely to generate a number of possible matches that may well be too large to investigate comprehensively through human review. In such cases, it may be possible to use third-party data (such as telephone books, credit header records, records of property ownership, and so on, discussed further in Appendix C) to resolve many of these ambiguities without human intervention, thus improving match accuracy.

For example, consider the two records R-1 and R-2 in Box B.4. If a human judge were faced with such a possible match, he might make a manual request from the neighboring county to compare signatures, or contact the voter, or prepare a letter to send to both addresses. However, if a search of a tertiary data source such as credit header data turned up record R-3, it would provide fairly strong evidence that records R-1 and R-2 in fact refer to the same individual. Alternatively, if the search turned up record R-4, it would provide some confidence that records R-1 and R-2 did not refer to the same person.

Note that the use of tertiary data in such a manner does not depend on a pairwise comparison between two data sources. Many list comparison systems are designed to compare one input file to another. If there is a third input file to process, the first output file is then compared to the third file (i.e., again a pairwise comparison). The approach illustrated above—a simple case of entity resolution—considers the data from all sources as their union (in the logical, set-theoretical sense).29

MATCHING RECORDS WITH UNIQUE IDENTIFIERS

Many of the difficulties described above can be reduced or eliminated through the use of a unique identifier (UID) for every voter, such as a driver’s license number. If every voter has a single UID, records for a voter can be matched more simply.

In practice, even UIDs are sometimes improperly keyed in transcribing from a handwritten application or improperly recorded on the application (for example, because digits were transposed or one digit

28

William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg, “A Comparison of String Metrics for Matching Names and Addresses,” pp. 73-78 in Proceedings of the Workshop on Information Integration on the Web, International Joint Conference on Artificial Intelligence, Acapulco, Mexico, August 2003; William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg, “A Comparison of String Distance Metrics for Name-Matching Tasks,” Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington D.C., August 2003.

29

Jeff Jonas blog entry, Entity Resolution Systems vs. Match Merge/Merge Purge/List De-Duplication Systems (http://jeffjonas.typepad.com/jeff_jonas/2007/09/entity-resoluti.html).



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement