and last four digits of the Social Security number, have a high likelihood of uniquely identifying an individual.4)

Errors in record-level matching may be false positives (a match is indicated when in fact the two records refer to different individuals) or false negatives (a nonmatch is indicated when the two records refer to the same individual). What is an acceptable upper limit on a given type of error depends on the application in question. For example, if the voter registration database is being checked against a database of felons or dead people, a low rate of false positives is needed to reduce the likelihood that eligible voters are removed from the VRD. Just how low a rate is acceptable is a policy choice.

In this report, the term “field-level match” denotes the process of comparing individual fields, so that the “first name” field of a record in Database 1 is compared to the “first name” field of a record in Database 2. In addition, a field-level match can be indicated on the basis of different match rules, which might include:

  • Exact match—the fields are exactly equal, character by character for every character.

  • Fuzzy or approximate match, which is intended to deal with typographical variation. At its simplest level, it allows comparison of fields with very simple errors (“Smith” versus “Smoth”). Fuzzy matching methods can be developed intuitively as seems to be the case in many VRD applications or based on principles that computer scientists have shown to work consistently well in practice.

  • Content equivalence—”Road” and “Rd,” or “Bill” and “William” are treated as equal.

The need for such rules arises for many reasons, not the least of which is that when asked for information, people often provide inconsistent information unintentionally. They use nicknames, include or omit middle initials, use abbreviations or not, and so on—and forget what they have done on previous occasions. An area code for a phone number may have changed. A street address might be recorded with digits transposed in the house number, or a street name spelled incorrectly, or with the wrong Zip code.

A record-level match occurs when several field-level matches are indicated. The decision about how many field-level matches are needed to define a record-level match is an important influence on the accuracy of the match. For example, a record-level match rule that required only field-level matches on first name and last name would lead to many more false positives than a rule requiring field-level matches on first name, last name, and date of birth. If the former rule were used instead of the latter to remove voters from registration lists (for example, if the voter registration list were compared against a list of state felons), many more eligible voters would be improperly removed.5 (In principle and sometimes in practice, matching algorithms can also consider differences as well as similarities. For example, if the

4

One way to estimate how many combinations exist is to consider that the population of the United States is currently approximately 300 million. The number of possible four-digit SSNs is 10,000. A plausible estimate of the number of distinct birth dates (month, day, year) is perhaps 365 × 70 = ~ 25,000. Thus, there are around 250 million possible combinations of birth date and four-digit SSN, which corresponds approximately to about one such combination for every American.

5

An example of such a problem was a case with a record-level match conducted to identify felons in the voter registration database in Florida before the 2000 election. In matching the Florida VRD to a national list of felons, the applicable rule used exact field-level matches on the first four letters of the first name, middle initial, gender, and last four digits of the Social Security number (when available) and used approximate matches for last name (matching on 80 percent of the letters in the last name) and date of birth. Certain name variations were also explicitly taken into account (Willie could match William; John Richard could match Richard John). The result of this match was that approximately 15 percent of the names removed from the VRD were improperly removed. See Gregory Palast, “The Wrong Way to Fix the Vote,” The Washington Post, Sunday, June 10, 2001, Outlook section, p. 1, available at http://www.legitgov.org/palast_wrong_way_fix_vote.html. To remediate the issues raised in this case, Choicepoint—the firm responsible for conducting the match—agreed to a very detailed set of criteria described in Box B.1.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement