BOX B.3

Blocking and String Comparators

The two methods for dealing with minor typographical variation are blocking and string comparators. The idea of blocking was to search on given characteristics and use remaining information to compute matching scores. For instance, a search might be performed on first initials “J” and “S” and year of birth to retrieve records for which all remaining information is considered to compute a matching score against a record in another database for John Smith. A string comparator allows computation of a value for partial agreement for two strings. For instance, a comparison of “John” with “John” might yield a value of 1.0; a comparison of “Johm” with “John” might yield 0.90; and a comparison of “Smith” with “Smeth” might yield 0.94.

The overall matching score can be reduced from the score associated with exact character-by-character agreements on individual fields to account for the partial agreements. Widely used string comparators are edit distance and the Jaro-Winkler string comparator.1 Code for both methods is widely available on the Internet. Independent verification has consistently shown that the Jaro-Winkler comparator is 10 times as fast as edit distance and returns equally high-quality results with administrative lists of the types that are similar to voter registration databases or department of motor vehicle files.

Other technical approaches to blocking and string comparators can be found in Fienberg et al.2

  

1William E. Winkler, “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage,” Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 354-359, 1990; William E. Winkler, “Overview of Record Linkage and Current Research Directions,” Statistical Research Division, U.S. Bureau of the Census, Washington, D.C., 2006, available at http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf.

  

2William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg, “A Comparison of String Metrics for Matching Names and Addresses,” pp. 73-78 in Proceedings of the Workshop on Information Integration on the Web, International Joint Conference on Artificial Intelligence, Acapulco, Mexico, August 2003; William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg, “A Comparison of String Distance Metrics for Name-Matching Tasks,” Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington D.C., August 2003.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement