Blocking and String Comparators
The two methods for dealing with minor typographical variation are blocking and string comparators. The idea of blocking was to search on given characteristics and use remaining information to compute matching scores. For instance, a search might be performed on first initials “J” and “S” and year of birth to retrieve records for which all remaining information is considered to compute a matching score against a record in another database for John Smith. A string comparator allows computation of a value for partial agreement for two strings. For instance, a comparison of “John” with “John” might yield a value of 1.0; a comparison of “Johm” with “John” might yield 0.90; and a comparison of “Smith” with “Smeth” might yield 0.94.
The overall matching score can be reduced from the score associated with exact character-by-character agreements on individual fields to account for the partial agreements. Widely used string comparators are edit distance and the Jaro-Winkler string comparator.1 Code for both methods is widely available on the Internet. Independent verification has consistently shown that the Jaro-Winkler comparator is 10 times as fast as edit distance and returns equally high-quality results with administrative lists of the types that are similar to voter registration databases or department of motor vehicle files.
Other technical approaches to blocking and string comparators can be found in Fienberg et al.2