B
Matching Records Across Databases

As noted in Appendix A, HAVA and the NVRA direct the states to implement a variety of procedures that require the “coordination” of voter registration databases (VRDs) with other databases. The central technical issue in such coordination (known in this appendix as “matching” or, more precisely, record-level matching) is finding individuals who are represented in both the VRD and another database (or the reverse—finding an individual who is represented in only one of these databases). (In the case of removing duplicate registrations, the “coordination” occurs within the same database.)

THE BASIC PROCESS OF MATCHING RECORDS ACROSS DATABASES WITHOUT UNIQUE IDENTIFIERS1

The basic element of a VRD is a record with data contained within specific fields associated with an individual—first name, last name, street address, date of birth, and so on. Databases may differ in the number of fields that a given record contains (for example, one database may include a field for telephone number and another might not) or in definitions of the fields (for example, one database may have one field for street name and number together (123 Main Street), and another may have separate fields for street name (Main Street) and street number (123)).

Matching records across databases (that is, record-level matching) involves the comparison of corresponding fields between databases. HAVA requires states to check the information provided on a new voter registration application against the databases of the state’s motor vehicle agency if the applicant provides a driver’s license number. An applicant must provide a driver’s license number if one is available, and the election officials must verify the applicant’s information with the state department of motor vehicles. If the applicant does not have a driver’s license, he or she must provide the last four digits of his or her Social Security number (SSN4), in which case the applicant’s information is verified with the Social Security Administration (SSA). (In practice, many DMVs handle the request. The election officials submit the verification query to the DMV, which may involve a driver’s license number or an SSN4. If the query involves SSN4, the DMV passes the request to the SSA using the AAMVAnet, a private network

1

For an overall background document that covers many elementary aspects of matching records (that is, record linkage), see William E. Winkler, “Matching and Record Linkage,” pp. 355-384 in Business Survey Methods, Brenda G. Cox et al. (eds.), Wiley, New York, 1995.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 65
B Matching Records Across Databases As noted in Appendix A, HAVA and the NVRA direct the states to implement a variety of procedures that require the “coordination” of voter registration databases (VRDs) with other databases. The central technical issue in such coordination (known in this appendix as “matching” or, more precisely, record- level matching) is finding individuals who are represented in both the VRD and another database (or the reverse—finding an individual who is represented in only one of these databases). (In the case of removing duplicate registrations, the “coordination” occurs within the same database.) THE BASIC PROCESS OF MATCHING RECORDS ACROSS DATABASES WITHOUT UNIQUE IDENTIFIERS1 The basic element of a VRD is a record with data contained within specific fields associated with an individual—first name, last name, street address, date of birth, and so on. Databases may differ in the number of fields that a given record contains (for example, one database may include a field for telephone number and another might not) or in definitions of the fields (for example, one database may have one field for street name and number together (123 Main Street), and another may have separate fields for street name (Main Street) and street number (123)). Matching records across databases (that is, record-level matching) involves the comparison of cor- responding fields between databases. HAVA requires states to check the information provided on a new voter registration application against the databases of the state’s motor vehicle agency if the applicant provides a driver’s license number. An applicant must provide a driver’s license number if one is avail - able, and the election officials must verify the applicant’s information with the state department of motor vehicles. If the applicant does not have a driver’s license, he or she must provide the last four digits of his or her Social Security number (SSN4), in which case the applicant’s information is verified with the Social Security Administration (SSA). (In practice, many DMVs handle the request. The election officials submit the verification query to the DMV, which may involve a driver’s license number or an SSN4. If the query involves SSN4, the DMV passes the request to the SSA using the AAMVAnet, a private network 1 For an overall background document that covers many elementary aspects of matching records (that is, record linkage), see William E. Winkler, “Matching and Record Linkage,” pp. 355-384 in Business Surey Methods, Brenda G. Cox et al. (eds.), Wiley, New york, 1995. 

OCR for page 65
 IMPROVING STATE VOTER REGISTRATION DATABASES established by the American Association of Motor Vehicle Administrators, and the response from SSA is passed back to the DMV through the same network.) Individual states also have the authority to—and often do—use additional databases and criteria to verify voter registration information. 2 The matching process is greatly simplified if each individual has used the same unique identifier (such as the driver’s license number or the full Social Security number) in each database. 3 In this case, matching records across databases is simplified. However, in the absence of a unique identifier, it is necessary to use combinations of fields in order to match records. Matches based on the comparison of corresponding fields such as first name, last name, address, and date of birth are inherently inferential, and thus subject to higher rates of error. (Some combinations, such as first name, last name, date of birth, and last four digits of the Social Security number, have a high likelihood of uniquely identifying an individual.4) Errors in record-level matching may be false positives (a match is indicated when in fact the two records refer to different individuals) or false negatives (a nonmatch is indicated when the two records refer to the same individual). What is an acceptable upper limit on a given type of error depends on the application in question. For example, if the voter registration database is being checked against a database of felons or dead people, a low rate of false positives is needed to reduce the likelihood that eligible voters are removed from the VRD. Just how low a rate is acceptable is a policy choice. In this report, the term “field-level match” denotes the process of comparing individual fields, so that the “first name” field of a record in Database 1 is compared to the “first name” field of a record in Database 2. In addition, a field-level match can be indicated on the basis of different match rules, which might include: • Exact match—the fields are exactly equal, character by character for every character. • Fuzzy or approximate match, which is intended to deal with typographical variation. At its simplest level, it allows comparison of fields with very simple errors (“Smith” versus “Smoth”). Fuzzy matching methods can be developed intuitively as seems to be the case in many VRD applications or based on principles that computer scientists have shown to work consistently well in practice. • Content equivalence—“Road” and “Rd,” or “Bill” and “William” are treated as equal. The need for such rules arises for many reasons, not the least of which is that when asked for infor- mation, people often provide inconsistent information unintentionally. They use nicknames, include or omit middle initials, use abbreviations or not, and so on—and forget what they have done on previous occasions. An area code for a phone number may have changed. A street address might be recorded with digits transposed in the house number, or a street name spelled incorrectly, or with the wrong zip code. A record-level match occurs when several field-level matches are indicated. The decision about how many field-level matches are needed to define a record-level match is an important influence on the accu- racy of the match. For example, a record-level match rule that required only field-level matches on first name and last name would lead to many more false positives than a rule requiring field-level matches on first name, last name, and date of birth. If the former rule were used instead of the latter to remove 2 See Election Assistance Commission, Impact of the National Voter Registration Act on Federal Elections 00-00, Table 12, “Verification of Applications,” p. 72, available at http://www.eac.gov/clearinghouse/docs/the-impact-of-the-national-voter- registration-act-on-federal-elections-2005-2006/attachment_download/file. 3 In fact, even the full SSN is flawed as a unique identifier, as the SSA has been known from time to time to issue the same SSN to different individuals. Identity theft in which an individual appropriates someone else’s SSN has also happened. Lastly, because the SSN lacks a check digit and is most often entered manually (rather than swiped as credit cards are), typographical errors often occur with no way of catching them at the point of entry. 4 One way to estimate how many combinations exist is to consider that the population of the United States is currently ap - proximately 300 million. The number of possible four-digit SSNs is 10,000. A plausible estimate of the number of distinct birth dates (month, day, year) is perhaps 365 x 70 = ~25,000. Thus, there are around 250 million possible combinations of birth date and four-digit SSN, which corresponds approximately to about one such combination for every American.

OCR for page 65
 APPENDIX B voters from registration lists (for example, if the voter registration list were compared against a list of state felons), many more eligible voters would be improperly removed.5 (In principle and sometimes in practice, matching algorithms can also consider differences as well as similarities. For example, if the name and date of birth are the same but the Social Security number and gender values are inconsistent between the records, a nonmatch might be indicated under some circumstances.) States have considerable discretion to decide for themselves the criteria to be used for matching, although these criteria cannot be used to disenfranchise legitimate voters. 6 Some states will use fuzzy matching and others exact matching for checking any given data field. States also vary in the fields that they check—for example, some will compare addresses and others will not. In general, some election offices may be using match criteria without sufficient consideration of possible false-positive and false- negative error rates associated with different variants of the methods. The details of matching algorithms and the parameters used to drive them may have a substantial impact on the output of any matching process. For this reason, what appears to be a technical decision can have enormous policy significance. Box B.1 illustrates the levels of detail with which match criteria must be specified. Finally, a manual review of matches is sometimes performed. That is, under some circumstances, a voter registrar will review a match (or a nonmatch) indicated by automated processes. COMPLICATIONS IN MATCHING Apart from the issues involved in the matching criteria, a variety of data issues also complicate matching. Data quality (addressed in more detail in Appendix C) is impaired by many different sources of error, including illegible handwriting, incomplete or lost forms, and keypunching errors. Another problem occurs because certain names are quite common. For example, it is known that the name “John Smith” occurs between 30,000 and 60,000 times in national lists. This means that there are between 1.5 and 3.0 John Smith’s for each date of birth. Assuming there are 500 individuals named John Smith in a given state, then a certain (low) proportion of them will have the same date of birth. With certain other commonly occurring names, some chance agreements on dates of birth would be expected as well.7 This point suggests that more accurate record-level matching will take into account the possibil - ity of chance agreement on date of birth for certain commonly occurring combinations of first and last name, which will in turn require knowledge of the most common names in any given state. Such information can easily be computed from either state-held databases (such as the department of motor vehicles (DMV) or voter registration databases, whichever is of higher quality as indicated by fewer 5 An example of such a problem was a case with a record-level match conducted to identify felons in the voter registration database in Florida before the 2000 election. In matching the Florida VRD to a national list of felons, the applicable rule used exact field-level matches on the first four letters of the first name, middle initial, gender, and last four digits of the Social Security number (when available) and used approximate matches for last name (matching on 80 percent of the letters in the last name) and date of birth. Certain name variations were also explicitly taken into account (Willie could match William; John Richard could match Richard John). The result of this match was that approximately 15 percent of the names removed from the VRD were improperly removed. See Gregory Palast, “The Wrong Way to Fix the Vote,” The Washington Post, Sunday, June 10, 2001, Outlook section, p. 1, available at http://www.legitgov.org/palast_wrong_way_fix_vote.html. To remediate the issues raised in this case, Choicepoint—the firm responsible for conducting the match—agreed to a very detailed set of criteria described in Box B.1. 6 A description of the various practices employed by the various states in late 2005 can be found in Justin Levitt, Wendy R. Weiser, and Ana Muñoz, Making the List: Database Matching and Verification Processes for Voter Registration , Brennan Center, New york University, 2006, available at http://www.brennancenter.org/dynamic/subpages/download_file_49479.pdf. 7 See, for example, Michael P. McDonald, “The True Electorate: A Cross-Validation of Voter File and Election Poll Demograph - ics,” Public Opinion Quarterly 71(4):588-602, 2007; Michael P. McDonald and Justin Levitt, “Seeing Double Voting: An Extension of the Birthday Problem,” Election Law Journal 7(2):111-122, 2008.

OCR for page 65
 IMPROVING STATE VOTER REGISTRATION DATABASES Box B.1 The Detailed Nature of Match Criteria—An Illustration As an illustration of the detail with which match criteria must be specified, consider the following criteria taken from the consent decree in National Association for the Advancement of Colored People v. Katherine Harris, Secretary of State of Florida et al. (Case No. 0-20-CIV-Gold/Simonton, United States District Court, for the Southern District of Florida). Notice of Filing Fully Executed Copy of June 28, 2002, Choicepoint Settlement Agreement . . . 9. The matching criteria described in Paragraph A.8 . . . [are] as follows: ChoicePoint will identify all matches on the comprehensive list resulting from the processing described in Paragraphs A.2-A.7 that do not match based on all of the following data fields: • Validated 9 digit Social Security Number • Non-normalized (i.e., as name appears in original source data) Last Name • Non-normalized (i.e., as name appears in original source data) First Name • Non-normalized (i.e., as name appears in original source data) Middle Name • Suffix • Race • Gender • Date of Birth ChoicePoint will perform Social Security Number validation in accordance with guidelines estab- lished by the Social Security Administration. Records will be deemed to match under the criteria listed above if a middle name in one record begins with the same letter as a middle initial shown in the match record assuming all other fields listed above match. Records will be deemed not to match under the criteria listed above if they share common blank data fields among the fields listed above, except for cases in which the middle name field or suffix field is blank in both records. Records will be deemed not to match under the criteria listed above if one of the fields being compared contains data and the same field in the match record contains no data. typographical errors, more current entries, and so on) or commercially available databases (such as credit header records8). Matches involving common names may require additional processing (perhaps manual) and involve the use of additional information not contained in databases. For instance, a prior address may confirm a match on a name when date of birth is missing. An e-mail address, phone number, or other corrobo - rating information may confirm a match when there is a typographical error in any of the first name, last name, or date of birth. At the same time, using other fields may entail other complications. For example, addresses may be represented differently in different databases; for example, in one database, “123 Main Street” repre - sents an address, whereas in another database, addresses are represented in three fields (house number (“123”), street name (“Main”), and suffix (“Street”)). Address standardization is often required to fix this problem. 8 Credit headers refer to information in the credit report such as name, address, and phone number, not the credit history por- tion of the report.

OCR for page 65
 APPENDIX B Finally, the above technically oriented comments presume that the databases to be matched against the VRD are in fact available. But in the real world of state voter registration databases, fragmented state con - trol over state social service agencies and departments of motor vehicles, and state/county tensions regard- ing authority over voter registration, the politics of database availability are at least as challenging as the technology for matching. Achieving integration or interoperability of the information systems of election officials and of other state and/or local agencies may be deeply problematic if strong political leadership is not available to demand cooperation. Database-providing agencies not under the authority of state election officials (whether state or county) may well give low priority to meeting the election needs of the state, resulting in difficulties for state election officials in gaining access without undue delay or difficulty. For example, a database-providing agency may demand that election officials provide a voter registration list in a particular format that is hard or time-consuming to generate before the agency is willing to perform a match between the two databases. A more serious problem occurs when the database-providing agency is made responsible for matching the voter registration data against its own data—the agency may be unable to devote serious resources to doing so, or lack the inclination or skills to do the matching properly. An agency may be unmotivated to resolve or address possible interoperability problems. THE POSSIBLE IMPACT OF INADEQUATE RECORD-LEVEL MATCHING According to the EAC report Impact of the National Voter Registration Act on Federal Elections 00- 00,9 there were 36,277,749 voter applications received by 45 reporting states. Among those received, there were 10,938,385 changes of address or party; 2,196,608 duplicate applications; and 1,138,955 invalid or rejected applications—resulting in a total of 17,281,234 new registrants. 10 The percentage of applica- tions not entered into the database because they were “invalid or rejected” or “duplicate applications” was about 9 percent, a total of 3,335,563 in the 45 reporting states. For comparison purposes, Table 4b from page 50 of the EAC report indicates that 333,663 people from 34 reporting states were removed from voter registration lists due to presumed felony convictions. Once it is known that an application is not a duplicate, and not just a change of address or party, the application needs to be checked against the relevant databases. Table 12, “Verification of Applications,” on page 72 in the EAC report11 shows that each state has its own unique set of criteria for verifying the applications, ranging from states like Pennsylvania, which verifies only through the DMV and the SSA, to Montana, which verifies against the DMV, the SSA, Vital Records, “Match Against Voter Registration Databases,” “Tracking Returned Voter ID Cards,” “Tracking Returned Disposition Notices,” and “Verify Through Other Agency.” According to Table 13, “Data Fields for Comparison to Identify Duplications,” in the EAC report, 15 states verify using the address; 48 verify the date of birth; 38 verify the driver’s license number; 46 verify the names provided by the registrant; 40 verify “Social Security number” (although surely that is just the last four digits in most cases, since according to Table 11, pages 68-69, in the EAC report, only 7 states use the full SSN); and 10 verify “other” data. Consider two points. First, the state with the highest rate of “invalid or rejected” applications (Table 3, p. 38, in the EAC report) also reported in this survey that it verifies application information only through the DMV and the SSA (Table 12). Second, the state reporting in this survey the highest percentage of applications rejected because they were duplicates also reports in this survey that it uses 9 Available at http://www.eac.gov/clearinghouse/docs/the-impact-of-the-national-voter-registration-act-on-federal-elections- 2005-2006/attachment_download/file. 10 The EAC report also notes that it “may also have under-reported various voter registration activities because several States were in the middle of converting their local voter registration files into a statewide system in 2005. As a result, some States indi - cated that their local jurisdictions stopped keeping track of various registration functions and activities because they understood the State would be compiling this information” (p. 10). 11 In this and the next paragraph, the tables (and page numbers) referred to are in the EAC report Impact of the National Voter Registration Act on Federal Elections 00-00, available at http://www.eac.gov/clearinghouse/docs/the-impact-of-the-national- voter-registration-act-on-federal-elections-2005-2006/attachment_download/file.

OCR for page 65
0 IMPROVING STATE VOTER REGISTRATION DATABASES only date of birth and names provided by the applicant to identify duplications (Table 13 in the EAC report). These points do not prove a causal relationship between use of a small number of non-VRD databases or a small number of fields in verification and a high percentage of rejected applications, but presuming that the data reported are valid and accurately reported, these points raise the question of how a broader set of criteria would have changed the percentage of applications rejected. 12 AN IMPORTANT ExAMPLE OF MATCHING IN PRACTICE To illustrate the issues described above, consider a record-level match based on exact matches for an individual’s first and last name, the month and year of birth, and the last four digits of an SSN. This example is significant because HAVA requires the Social Security Administration to check the name, date of birth, and the last four digits of the SSN (“the applicable information”) in support of the fed - eral voting process (usually to verify information for first-time voter applicants who do not provide a driver’s license number to be checked against state DMV records), and to notify the voter registrar if the person so identified is deceased. (This requirement does not mean that the SSA mechanism is the only means through which voter information can be verified—states with other mechanisms available to them can select another method. According to the Brennan Center, 24 states in late 2005 planned to use the process described above.13) The requirement of using only the last four digits of the SSN increases the number of false positives, even though the absolute number of false positives is still quite low. The limitation to the use of the last four digits of the SSN reflects a balancing between a more effective matching of records and concerns about privacy. Upon receipt of the applicable information, the SSA queries its database and returns one of five responses: no match found; one unique match, death indicator absent; one unique match, death indica - tor present; multiple matches found with at least one lacking a death indicator; or multiple matches found but all with death indicator. As noted above, the query is based on searching for exact matches on the applicable information. At its November 2007 workshop, the committee heard testimony that this particular strategy for matching was developed by the SSA through the efforts of a working group involving the National Association of Secretaries of State, National Association of State Election Direc - tors, American Association of Motor Vehicle Administrators, and five states. However, to the best of the committee’s knowledge, no testing of match criteria was conducted in advance of deployment, and the error rates that such a strategy would entail were unknown at the time of deployment. This strategy has a number of limitations that would prevent records from being matched when they should be matched. For example, the search query does not account for content equivalence of names (so that Bill and William are regarded as completely different names). Using only the first and last name causes difficulty, because the number of multiple and compound names is increasing rapidly in the population. In addition, a full legal name was not originally required to obtain an SSN, and thus many SSA records do not contain the full legal names of individuals. Changes in last name (for example, of women who change their last names through marriage) are also problematic, as someone may not report a change of last name to the SSA until it is needed to determine Social Security benefits. In addi - tion, individuals were not required until 1972 to provide SSA proof of identity when applying for an SSN. Finally, individuals may still have been assigned SSNs even if their applications did not contain birth date information. 12 The committee recognizes that the issue of data validity is an important one. For example, states may have reported their fig - ures using definitions or criteria that were not uniform across all reporting jurisdictions. Issues with terminology are also known to cause difficulties for survey design. Until such matters are resolved, these data can only be regarded as providing tentative indications of possible relationships. 13 Justin Levitt, Wendy R. Weiser, and Ana Muñoz, Making the List: Database Matching and Verification Processes for Voter Registration, Brennan Center, New york University, 2006, available at http://www.brennancenter.org/dynamic/subpages/ download_file_49479.pdf.

OCR for page 65
 APPENDIX B Data provided by the SSA to the committee’s second workshop in November 2007 indicate that 55 percent of queries result in at least one match being indicated; queries using the full SSN result in a match rate of about 88 percent. The cost per query is at less than one cent ($0.0062), which is low enough to allow election officials to vary the queries themselves in the event that a nonmatch response is received (for example, querying on “Bill” if “William” did not return a match). As an example of a matching procedure in action, consider the elements of a new voter registration application card as shown on the left below and the SSA record on the right (presume these records are, in fact, supposed to refer to the same person): New Registration Card SSA Record Tom T Bowden Taylor T Bowden 3121 Escondido Way 11/04/77 11/04/77 SSN 000001087 SSN 000001087 In this case, the SSA would return a response of “no match found.” However, if the voter registrar could determine that either Tom has a middle name of Taylor or Taylor has a middle name of Tom or Thomas, then this registrar could associate these records with some degree of confidence if he or she concluded that the first and middle names have been transposed. But in the absence of other informa - tion, the registrar has no way to make such a determination. States vary in their treatment of what happens in the event that an applicant’s information cannot be matched against the SSA or DMV databases. In some cases, a state may grant the applicant a con - ditional registration that requires the voter to present an ID at the polls before voting (indeed, in some states, all first-time voters are required to present an ID at the polls, regardless of whether a match is found); others may provide a provisional ballot to the voter on election day. As of June 2009, a Florida law that requires a nonmatch to result in an applicant not being registered was being challenged. 14 In still other cases, states register the voter without a provisional status (though they may flag first-time voters who have registered by mail). IMPROVING RECORD-LEVEL MATCHING In general, three approaches can be used to improve record-level matching: allowing more data (that is, using more data fields or more complete data fields in performing the match), improving the quality of the data contained in the relevant databases (including the use of tertiary/external data), and introducing systematic field-level matching algorithms to augment certain locally developed matching techniques. The first approach often runs afoul of privacy concerns, and it requires policy makers to be willing to make a tradeoff between less privacy and better record-level matching. In this case, experiments with using more data fields or more complete data fields are necessary to determine the incremental benefit in record-level matching (for example, using an additional field in the match or using the last six digits of the SSN for matching instead of only the last four). The second approach, improving data quality, is addressed in more detail in Appendix C. For purposes of this report, “ad hoc matching” is used to mean matching developed on the basis of intuitive reasoning that is not further validated systematically or analyzed with mathematical rigor. By contrast, systematic matching is based on a formal mathematical approach that develops metrics to measure match efficacy. With metrics in hand, policy makers can set scales for three relevant areas—what 14 See Florida State Conference of the NAACP . Browning, available at http://moritzlaw.osu.edu/electionlaw/litigation/Florida - NAACPv.Browning.php.

OCR for page 65
 IMPROVING STATE VOTER REGISTRATION DATABASES determines a match, what determines a nonmatch, and what is indeterminate. In addition to good tech - niques for dealing with typographical error (discussed in next section), implementation of systematic techniques for matching can use some or all of the following elements: • Use of modern matching techniques (also known in the statistical literature as techniques for record linkage). For example, a model introduced by Fellegi and Sunter15 formalizes ideas of Howard New- combe based on likelihood ratios in which it becomes somewhat easier to estimate record linkage param- eters (even without training data). Training data is a large representative “truth” set of truly matching and nonmatching pairs of records. In the Fellegi-Sunter model each pair is given a score (or weight). The higher the score, the more likely a pair is to be a match. • Use of preprocessing to standardize data elements. Preprocessing involves breaking fields into com- ponents and standardizing components, and a common preprocessing application is the use of address standardization software in which a house-number-and-street-name type of address may be broken into house number, street name, direction words (such as East, Southwest, and so on), and street type (Drive, Avenue) that are given standard spellings or abbreviations. Other methods can facilitate use of name information.16 Although some of the methods described in this appendix are a good starting point, individual states may need to have specific methods for the types of idiosyncrasies and errors relevant to their individual needs. • Accounting for the relatie frequency of occurrence of alues of strings such as first and last names. A relatively rare name such as “zabrinsky” has more distinguishing power than a common name such as “Smith.” The primary purpose of the frequency-based (or value-specific) matching is to downweight pairs having the more commonly occurring values of strings. If one has a large file representing an entire state, then one can compute the frequency-based scores associated with different strings by comparing the entire file against itself. The entire file becomes the surrogate training data. These ideas were intro - duced by Newcombe and extended by Fellegi and Sunter17 and by Winkler18 (Box B.2) in demonstrating how to implement frequency-based matching. In production matching software for the Decennial Cen - suses (1990 and beyond), Winkler had methods that automatically created the frequency-based weights. The distinguishing power of a particular name may vary considerably by geography. In Minnesota, for example, names such as “Garcia” and “Martinez” were relatively rarer and given more distinguishing power; in California the names are much more common and given less distinguishing power. • Estimation of optimal matching parameters (probabilities in the Fellegi-Sunter model) for classifying pairs as matches or nonmatches. The probabilities can be computed by comparing an entire state file against itself, using a simple unsupervised learning method such as a properly applied expectation-maximiza - tion algorithm,19 or an alternative method.20 The optimal parameters have the effect of better separat- ing matches from nonmatches. Although this improves matching, it does not yield estimates of error rates. • Proiding methods for estimating false match rates. Estimates of matching rates vary according to the matching scores (or weights). A certain false match rate will be associated with the designation of all 15 Ivan P. Fellegi and Alan B. Sunter, “A Theory for Record Linkage,” Journal of the American Statistical Association 64(328):1183- 1210, December 1969. 16 See William E. Winkler, “Business Name Parsing and Standardization Software,” unpublished report, Statistical Research Division, U.S. Bureau of the Census, Washington, D.C., 1993; and William E. Winkler, “Advanced Methods for Record Linkage,” Proceedings of the Section on Surey Research Methods, American Statistical Association, pp. 467-472, 1994. 17 Ivan P. Fellegi and Alan B. Sunter, “A Theory for Record Linkage,” Journal of the American Statistical Association 64(328):1183- 1210, December 1969. 18 William E. Winkler, “Frequency-based Matching in the Fellegi-Sunter Model of Record Linkage,” Proceedings of the Section on Surey Research Methods, American Statistical Association, pp. 778-783, 1989. 19 William E. Winkler, “Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage,” Proceedings of the Section on Surey Research Methods, American Statistical Association, pp. 667-671, 1988. 20 William E. Winkler, “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Link - age,” Proceedings of the Section on Surey Research Methods, American Statistical Association, pp. 354-359, 1990.

OCR for page 65
 APPENDIX B pairs above a value U1 as matches. If all pairs above a value U2 are designated as matches where U2 > U1, then the typical result is a lower false match rate and fewer pairs designated as matches. Belin and Rubin21 and Winkler22 have given unsupervised learning methods for estimating false match rates in situations for which there are no training data. • Proiding methods for estimating false nonmatch rates. Estimates of false nonmatches may partially be accomplished via methods of Winkler,23 although these techniques may need to be modified if they are to be used on state DMV and VRD files. • Use of indexes and keyed search strategies to speed up the matching process when necessary . Although most changes to VRDs are incremental, an operation involving entire database-to-database comparisons may sometimes be necessary. If two databases each have 5 million records, the number of possible pairs that must be compared is 25 × 1012, a number that is much too large to search with most computer sys - tems available to states. Optimized candidate selection strategies may be needed to reduce significantly the number of pairs that must be compared if the databases involved are large. • Use of automated name-matching logic that is guided and enhanced by culturally sensitie syntactic and semantic knowledge that accounts for different naming conentions. As discussed in Appendix C (“Data Capture and Quality”), different cultures have different conventions for how names are formed. For example, —Common American naming conventions regard certain names as equivalent (for example, Bill, Billy, and Will for William). The use of automated name rooting and name equialency tables could be used to automatically generate common variants of a given name. Such tables would greatly reduce the need for multiple manual queries using name variants. —Hispanic and Asian naming conventions for what parts of a name should be considered a surname do not fit easily into the conventional American convention of “first-name, middle name, last name.” The use of automated name ordering could be used to automatically generate permutations of all types of ethnic surnames from the text string that makes up the complete name. Implementation of name rooting and name ordering at the SSA would benefit all states that verify voter registration information using the SSA. Notably, name rooting and name ordering could be used as a component of any intrastate query mechanism as well. MATCHING IN THE PRESENCE OF TyPOGRAPHICAL ERROR One of the most difficult problems in matching is finding appropriate matches in the presence of typographical errors in the data. If the amount of typographical error in the files to be compared is small, then it is relatively easy to find pairs that agree on name and date-of-birth characteristics (for example).24 However, if there is significant typographical error, then it is not possible to bring together pairs using straightforward character-by-character matching on name and date of birth. For instance if first name, 21 Thomas R. Belin and Donald B. Rubin, “A Method for Calibrating False-Match Rates in Record Linkage,” Journal of the Ameri- can Statistical Association 90(430):694-707, 1995. 22 William E. Winkler, “Automatic Estimation Record Linkage False Match Rates,” Proceedings of the Section on Surey Research Methods, American Statistical Association, CD-ROM. Also available at http://www.census.gov/srd/papers/pdf/rrs2007-05.pdf. 23 William E. Winkler, “Matching and Record Linkage,” pp. 355-384 in Business Surey Methods, Brenda G. Cox et al. (eds.), Wiley, New york, 1995; William E. Winkler, “Approximate String Comparator Search Strategies for Very Large Administrative Lists,” Proceedings of the Section on Surey Research Methods, American Statistical Association, 2004. 24 Whether these pairs in fact refer to the same person is an entirely separate question, because name and date of birth do not uniquely identify an individual. For instance, a given large state may have 1,000 individuals with the name “John Smith” and it is likely that some of the “John Smith” pairs will agree on date of birth. It may well be necessary to conduct other follow-up (such as manual examination of other data fields such as street address and zip code) or to use data from third parties to help delineate the true match status of the pair.

OCR for page 65
 IMPROVING STATE VOTER REGISTRATION DATABASES Box B.2 Accounting for Commonly Occurring Names The earliest computerized record linkage methods do effectively account for the commonly occurring name plus “chance” date-of-birth phenomenon. Newcombe’s matching classification rule was to use the fields in pairs of records to compute a matching score. The idea was that agreement on individual fields was more likely to occur among “truly matching” pairs. Pairs above a certain upper bound were designated as matches; pairs below a certain lower bound were designated as nonmatches; and pairs with in-between scores were held for clerical review (when aux- iliary information might be used to fill in missing information or “correct” contradictory information). If the upper bound is raised, then the false positive (false match) rate decreases. If the lower bound is decreased, then the false negative (false nonmatch) rate decreases. The frequencies (probabilities) used in computing the scores can be estimated a priori using the frequencies in the large administrative lists, recognizing that matters such as “the list of most common names” will change slowly over time (which requires periodic adjustment of that set and the probabilities that those names will occur). Efficiently computed frequencies (conditional probabilities) are optimal in the sense that they can minimize the size of the clerical review region. Further, in many situations such as with voter registration databases or department of motor vehicles files, it is possible to estimate or give reasonable approximations of the error rates even without training data.2 The earliest matching parameter and error-rate estimation procedures are the easiest to implement and most likely appropriate for VRD 1 Howard B. Newcombe et al., “Automatic Linkage of Vital Records,” Science 130(3381):954-959, October 1959; Howard B. Newcombe and James M. kennedy, “Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information,” Communications of the Association for Computing Machinery 5(11):563-566, November 1962. 2 William E. Winkler, “Comparative Analysis of Record Linkage Decision Rules,” Proceedings of the Section on Surey Re- search Methods, American Statistical Association, pp. 829-834, 1992; William E. Winkler, “Improved Decision Rules in the Fel- legi-Sunter Model of Record Linkage,” Proceedings of the Section on Surey Research Methods, American Statistical Association, pp. 274-279, 1993; William E. Winkler, “Automatic Estimation Record Linkage False Match Rates,” Proceedings of the Section on Surey Research Methods, American Statistical Association, CD-ROM, 2006, also at http://www.census.gov/srd/papers/pdf/ rrs2007-05.pdf; Thomas R. Belin and Donald B. Rubin, “A Method for Calibrating False-Match Rates in Record Linkage,” Journal of the American Statistical Association 90(430):694-707, 1995. last name, and year of birth have 3 percent typographical error, then 9 percent (3 fields times 3 percent error in each field) of truly matching pairs may be missed with exact character-by-character matching. An example of typographical error is provided in Box B.3. To overcome some of the difficulties caused by typographical error, modern techniques for matching are based on the computation of a score that indicates the degree of match rather than the generation of a yes-no result for any given comparison. Comparisons can be made at the level of individual fields or at the record level. String comparators compare text strings within individual fields; the Jaro-Winkler (JW) and edit- distance string comparators have been described elsewhere,25 and code (C, C++, JAVA) is readily available on the Internet. The text strings to be compared are arbitrary, and in particular can represent names (or parts of names) or dates of birth (in some standardized format). These techniques provide 25 William E. Winkler, “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Link - age,” Proceedings of the Section on Surey Research Methods, American Statistical Association, pp. 354-359, 1990; William E. Winkler, “Overview of Record Linkage and Current Research Directions,” Statistical Research Division, U.S. Bureau of the Census, Wash - ington, D.C., 2006, available at http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf.

OCR for page 65
 APPENDIX B files. The most general version of the parameter estimation procedures3 generalize the iterative scaling procedures of Della Pietra et al.4 The frequency-based methods5 automatically adjust match scores downward for the most frequently occurring first and last names. The effect of the downward adjustment is that pairs of records that are as- sociated with commonly occurring names such as “James Smith” fall into an indeterminate region in which additional information (possibly via clerical review and contacting the voter) is required to determine matching status. In many situations, it is straightforward to obtain the extra matching information for the indeterminate pairs. Most other (much less commonly occurring names) can be matched effectively be- cause the false positive rate is much less than 0.004 percent when using the combination of name, date of birth, and last four digits of the SSN (that is, typically they uniquely identify). If the state VRD files can be examined a priori, then for each common first-name-last-name combina- tion, we can find the most frequent dates of birth and lower the matching score of the associated pairs of records. We first lower the matching score for the common name combination and then again for the common dates of birth. To match the pairs with the lowered matching scores, we would need additional corroborating information such as telephone number or middle initial. If driver’s license number or the last four digits of the SSN are available, then string comparators can be used to check whether the pairs of cor- responding numbers are almost the same. The corroborating information might vary somewhat in differing states. In particular, some states request telephone numbers and/or e-mail addresses. In this situation, it is possible to repeat analogous procedures to raise the worst-case false positive prob- abilities for certain specific name-date-of-birth combinations while significantly reducing the false match probabilities associated with the same name but different dates-of-birth combinations. This approach has the effect of significantly increasing the number of pairs of records for which match status can effectively be computed. 3 William E. Winkler, “On Dykstra’s Iterative Fitting Procedure,” The Annals of Probability 18(1):1410-1415, July 1990; Wil- liam E. Winkler, “Improved Decision Rules in the Fellegi-Sunter Model of Record Linkage,” Proceedings of the Section on Surey Research Methods, American Statistical Association, pp. 274-279, 1993. 4 Stephen Della Pietra et al., “Inducing Features of Random Fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4):380-393, April 1997. 5 Howard B. Newcombe et al., “Automatic Linkage of Vital Records,” Science 130(3381):954-959, October 1959; Howard B. Newcombe and James M. kennedy, “Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information,” Communications of the Association for Computing Machinery 5(11):563-566, November 1962. an automated mechanism for reducing the overall matching score from the score associated with exact character-by-character agreements on individual fields to account for partial agreement, thus account - ing for very minor typographical error between two strings that are nearly the same. For instance, a comparison of “John” with ”John” might yield a value of 1.0; a comparison of “Johm” with ”John” might yield 0.90; and a comparison of “Smith” with “Smeth” might yield 0.94. These techniques often outperform ad hoc methods of “fuzzy matching.” The Jaro-Winkler comparator is a fast alternative to “edit distance” (as much as 10 times faster) that measures the minimum number of insertions, deletions, and substitutions to get from one string to another and returns equally high-quality results with administrative lists of the types that are similar to voter registration databases or department of motor vehicles files. 26 26 William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg, “A Comparison of String Metrics for Matching Names and Addresses,” Proceedings of the Workshop on Information Integration on the Web, International Joint Conference on Artificial Intel- ligence, Acapulco, Mexico, pp. 73-78, August 2003; William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg, “A Com - parison of String Distance Metrics for Name-Matching Tasks,” Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington D.C., August 2003.

OCR for page 65
 IMPROVING STATE VOTER REGISTRATION DATABASES Box B.3 Example of Typographical Error First name Last name Date of birth _____________ ____________ ________________ a. Robert Smith 042964 b. Rovert Snith 0422963 2a. Susan Janes bbbb977 2b. Sue Jones 067976 NOTE: Date format is mmddyyyy; “b” represents missing. Comparisons at the record level are often based on a multiple pass strategy (sometimes called blocking or binning) in which pairs are brought together via characteristics that are believed to contain less typographical error and the remaining (or all) information in pairs is used in computing a match - ing score.27 For instance, a search might be performed on first initials “J” and “S” and year of birth to retrieve records for which all remaining information is considered to compute a matching score against a record in another database for John Smith. Blocking increases the number of possible pairs to be considered over what would be obtained if perfect agreement between fields were required. For example, a given blocking pass may bring together pairs that agree exactly on the date-of-birth field and also on the first character of the surname field. Because the first character of the surname typically is less likely to be in error (or is assumed to be so), this criterion is insensitive to some basic kinds of typographical error, e.g., “Smith” versus “Smoth.” For each of these pairs, a matching score is computed using the rest of the information in the available data fields. For example, the first-name field and the entire surname field are compared using string comparators, and the match score between the pair may be defined as the sum of the two field-level scores. When record-level match scores are available for individual pairs, a threshold can be established (on the basis of experience) to the minimum score necessary for a pair to be considered a match. It is common to use multiple passes through the data using different criteria. For example, a set of blocking criteria might be as follows: • Pass : date of birth and first character of surname. As indicated above, this pass accounts for typo- graphical errors in any part of the first name and in any part of the surname except the first character. Thus, it captures Bob and Rubert for Robert, Smoth for Smith). • Pass : day of birth, month of birth, and first three characters of surname. This pass accounts for errors in the year of birth, which are known to be less accurate in many computer files than the day of birth and month of birth. However, using only day of birth and month of birth would usually result in too many pairs for efficient computation, and so a part of the surname is used to reduce that number. Thus, this pass accounts for first names and last names with typographical error in the last portions of these fields and for reporting/transcribing variations in year of birth. 27 Howard B. Newcombe et al., “Automatic Linkage of Vital Records,” Science 130(3381):954-959, October 1959; Howard B. New- combe and James M. kennedy, “Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information,” Communications of the Association for Computing Machinery 5(11):563-566, November 1962.; Ivan P. Fellegi and Alan B. Sunter, “A Theory for Record Linkage,” Journal of the American Statistical Association 64(328):1183-1210, December 1969.

OCR for page 65
 APPENDIX B • Pass : first three characters of surname and first three characters of first name. This pass accounts for any errors at all in the birth date (such as mistranscribed year of birth and mistakenly exchanged day of birth and month of birth). In practice, the ordering of these passes matters. After Pass 1 is completed, all of the matching pairs above the relevant threshold score (indicating a match) are removed from the dataset and Pass 2 is performed on the remaining data. (More precisely, if a given pair exceeding threshold is composed of “name-a” and “name-b,” name-a is removed from File A and name-b is removed from File B. This process is repeated for all matching pairs, and then Pass 2 is performed on the reduced File A and File B.) Pass 3 operates on a similarly reduced File A and File B, except that these reduced files do not include names found in pairs that matched on Pass 2 and Pass 1. At the end of these multiple passes, all of the pairs exceeding the relevant threshold for one of the blocking criteria are considered matches. In general, the number of such pairs will be larger—sometimes substantially larger—than the number of pairs that would result if the matching criterion simply speci - fied an exact match on first name, last name, middle initial, and date of birth. Other technical approaches to blocking and string comparators can be found in Fienberg et al. 28 MATCHING RECORDS USING THIRD-PARTy DATA The use of blocking and string comparators is likely to generate a number of possible matches that may well be too large to investigate comprehensively through human review. In such cases, it may be possible to use third-party data (such as telephone books, credit header records, records of property ownership, and so on, discussed further in Appendix C) to resolve many of these ambiguities without human intervention, thus improving match accuracy. For example, consider the two records R-1 and R-2 in Box B.4. If a human judge were faced with such a possible match, he might make a manual request from the neighboring county to compare signatures, or contact the voter, or prepare a letter to send to both addresses. However, if a search of a tertiary data source such as credit header data turned up record R-3, it would provide fairly strong evidence that records R-1 and R-2 in fact refer to the same individual. Alternatively, if the search turned up record R-4, it would provide some confidence that records R-1 and R-2 did not refer to the same person. Note that the use of tertiary data in such a manner does not depend on a pairwise comparison between two data sources. Many list comparison systems are designed to compare one input file to another. If there is a third input file to process, the first output file is then compared to the third file (i.e., again a pairwise comparison). The approach illustrated above—a simple case of entity resolution—con - siders the data from all sources as their union (in the logical, set-theoretical sense). 29 MATCHING RECORDS WITH UNIQUE IDENTIFIERS Many of the difficulties described above can be reduced or eliminated through the use of a unique identifier (UID) for every voter, such as a driver’s license number. If every voter has a single UID, records for a voter can be matched more simply. In practice, even UIDs are sometimes improperly keyed in transcribing from a handwritten applica - tion or improperly recorded on the application (for example, because digits were transposed or one digit 28 William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg, “A Comparison of String Metrics for Matching Names and Addresses,” pp. 73-78 in Proceedings of the Workshop on Information Integration on the Web, International Joint Conference on Artificial Intelligence, Acapulco, Mexico, August 2003; William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg, “A Com- parison of String Distance Metrics for Name-Matching Tasks,” Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington D.C., August 2003. 29 Jeff Jonas blog entry, Entity Resolution Systems vs. Match Merge/Merge Purge/List De-Duplication Systems (http://jeffjonas. typepad.com/jeff_jonas/2007/09/entity-resoluti.html).

OCR for page 65
 IMPROVING STATE VOTER REGISTRATION DATABASES Box B.4 Illustrative Records Record R-: As written on Record R-2: As captured by the registration form Social Security Administration County A County B Daniel R Smith Dan Randal Smith 23 Post Street 456 Adele Lane My City Your City DLN 0873457345 SSN4 5657 DOB 6/944 DOB 6/944 Record R-3: As provided by credit header data (version  of Record R) Daniel Randal Smith DOB 6/944 Current address: 23 Post Street, My City Previous address: 456 Adele Lane, Your City SSN4 5657 Record R-4: As recorded by credit header data (version 2 of Record R) Daniel Richard Smith DOB 6/944 Current address: 23 Post Street, My City Previous address: 789 Temple Hills, Some Other City SSN4 1212 is illegible). If there is an error in the UID, a search could be performed using the name and the date of birth to find all possible UIDs associated with those names and dates to find the UID that is most similar to the one recorded in error—that UID would likely be the “correct” UID for the person in question. A more general strategy would be needed when there is a possibility of typographical error in every field. The matching strategy is to search the entire file and apply suitable proximity metrics that indicate that the UID, first name, last name, and date of birth are sufficiently close to the query record. The fea - sibility of this strategy depends on the frequency with which invalid UIDs are encountered, because it is not practical to sequentially read every record in the database and perform substantial computation on every record in the file for every query. The most general strategy involves substantial restructuring of the database to facilitate fast searches. keys such as first character of first name plus last name plus date of birth, telephone number, or house number plus street name are defined and added to the database to allow fast searches. Using all appro - priate fields, only records with proximity scores sufficiently close to the query record are retrieved for review. Definition of the keys and the order in which they are applied requires certain experience and skill.