established by the American Association of Motor Vehicle Administrators, and the response from SSA is passed back to the DMV through the same network.) Individual states also have the authority to—and often do—use additional databases and criteria to verify voter registration information.2

The matching process is greatly simplified if each individual has used the same unique identifier (such as the driver’s license number or the full Social Security number) in each database.3 In this case, matching records across databases is simplified. However, in the absence of a unique identifier, it is necessary to use combinations of fields in order to match records. Matches based on the comparison of corresponding fields such as first name, last name, address, and date of birth are inherently inferential, and thus subject to higher rates of error. (Some combinations, such as first name, last name, date of birth, and last four digits of the Social Security number, have a high likelihood of uniquely identifying an individual.4)

Errors in record-level matching may be false positives (a match is indicated when in fact the two records refer to different individuals) or false negatives (a nonmatch is indicated when the two records refer to the same individual). What is an acceptable upper limit on a given type of error depends on the application in question. For example, if the voter registration database is being checked against a database of felons or dead people, a low rate of false positives is needed to reduce the likelihood that eligible voters are removed from the VRD. Just how low a rate is acceptable is a policy choice.

In this report, the term “field-level match” denotes the process of comparing individual fields, so that the “first name” field of a record in Database 1 is compared to the “first name” field of a record in Database 2. In addition, a field-level match can be indicated on the basis of different match rules, which might include:

  • Exact match—the fields are exactly equal, character by character for every character.

  • Fuzzy or approximate match, which is intended to deal with typographical variation. At its simplest level, it allows comparison of fields with very simple errors (“Smith” versus “Smoth”). Fuzzy matching methods can be developed intuitively as seems to be the case in many VRD applications or based on principles that computer scientists have shown to work consistently well in practice.

  • Content equivalence—“Road” and “Rd,” or “Bill” and “William” are treated as equal.

The need for such rules arises for many reasons, not the least of which is that when asked for information, people often provide inconsistent information unintentionally. They use nicknames, include or omit middle initials, use abbreviations or not, and so on—and forget what they have done on previous occasions. An area code for a phone number may have changed. A street address might be recorded with digits transposed in the house number, or a street name spelled incorrectly, or with the wrong Zip code.

A record-level match occurs when several field-level matches are indicated. The decision about how many field-level matches are needed to define a record-level match is an important influence on the accuracy of the match. For example, a record-level match rule that required only field-level matches on first name and last name would lead to many more false positives than a rule requiring field-level matches on first name, last name, and date of birth. If the former rule were used instead of the latter to remove

2

See Election Assistance Commission, Impact of the National Voter Registration Act on Federal Elections 2005-2006, Table 12, “Verification of Applications,” p. 72, available at http://www.eac.gov/clearinghouse/docs/the-impact-of-the-national-voter-registration-act-on-federal-elections-2005-2006/attachment_download/file.

3

In fact, even the full SSN is flawed as a unique identifier, as the SSA has been known from time to time to issue the same SSN to different individuals. Identity theft in which an individual appropriates someone else’s SSN has also happened. Lastly, because the SSN lacks a check digit and is most often entered manually (rather than swiped as credit cards are), typographical errors often occur with no way of catching them at the point of entry.

4

One way to estimate how many combinations exist is to consider that the population of the United States is currently approximately 300 million. The number of possible four-digit SSNs is 10,000. A plausible estimate of the number of distinct birth dates (month, day, year) is perhaps 365 × 70 = ~25,000. Thus, there are around 250 million possible combinations of birth date and four-digit SSN, which corresponds approximately to about one such combination for every American.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement