Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
B Matching Records Across Databases As noted in Appendix A, HAVA and the NVRA direct the states to implement a variety of procedures that require the âcoordinationâ of voter registration databases (VRDs) with other databases. The central technical issue in such coordination (known in this appendix as âmatchingâ or, more precisely, record-level matching) is finding individuals who are represented in both the VRD and another database (or the reverseâfinding an individual who is represented in only one of these databases). (In the case of removing duplicate registrations, the âcoordinationâ occurs within the same database.) THE BASIC PROCESS OF MATCHING RECORDS ACROSS DATABASES WITHOUT UNIQUE IDENTIFIERS1 The basic element of a VRD is a record with data contained within specific fields associated with an individualâfirst name, last name, street address, date of birth, and so on. Databases may differ in the number of fields that a given record contains (for example, one database may include a field for telephone number and another might not) or in definitions of the fields (for example, one database may have one field for street name and number together (123 Main Street), and another may have separate fields for street name (Main Street) and street number (123). Matching records across databases (that is, record-level matching) involves the comparison of corresponding fields between databases. HAVA requires states to verify the information provided on a new voter registration application by verifying the applicable information with the stateâs motor vehicle agency, in the case of a driverâs license, or with the Social Security Administration (SSA), to verify the last four digits of the Social Security number. Individual states also have the authority toâand often doâuse additional databases and criteria to verify voter registration information.2 The matching process is greatly simplified if each individual has used the same unique identifier (such as the driverâs license number or the full Social Security number) in each database.3 In this case, matching records across databases is simplified. However, in the absence of a unique identifier, it is necessary to use combinations of fields in order to match records. Matches based on the comparison of corresponding fields such as first name, last name, address, and date of birth are inherently inferential, and thus subject to higher rates of error. (Some combinations, such as first name, last name, date of birth, 1 For an overall background document that covers many elementary aspects of matching records (that is, record linkage), see William E. Winkler, âMatching and Record Linkage,â pp. 355-384 in Business Survey Methods, Brenda G. Cox et al. (eds.), Wiley, New York, 1995. 2 See Election Assistance Commission, Impact of the National Voter Registration Act on Federal Elections 2005- 2006, Table 12, âVerification of Applications,â p. 72, available at http://www.eac.gov/clearinghouse/docs/the- impact-of-the-national-voter-registration-act-on-federal-elections-2005-2006/attachment_download/file. 3 In fact, even the full SSN is flawed as a unique identifier, as the SSA has been known from time to time to issue the same SSN to different individuals. Identity theft in which an individual appropriates someone elseâs SSN has also happened. Lastly, because the SSN lacks a check digit and is most often entered manually (rather than swiped as credit cards are), typographical errors often occur with no way of catching identifying them at the point of entry. 29
30 STATE VOTER REGISTRATION DATABASES: IMMEDIATE ACTIONS AND FUTURE IMPROVEMENTS and last four digits of the Social Security number, have a high likelihood of uniquely identifying an individual.4) Errors in record-level matching may be false positives (a match is indicated when in fact the two records refer to different individuals) or false negatives (a nonmatch is indicated when the two records refer to the same individual). What is an acceptable upper limit on a given type of error depends on the application in question. For example, if the voter registration database is being checked against a database of felons or dead people, a low rate of false positives is needed to reduce the likelihood that eligible voters are removed from the VRD. Just how low a rate is acceptable is a policy choice. In this report, the term âfield-level matchâ denotes the process of comparing individual fields, so that the âfirst nameâ field of a record in Database 1 is compared to the âfirst nameâ field of a record in Database 2. In addition, a field-level match can be indicated on the basis of different match rules, which might include: â¢ Exact matchâthe fields are exactly equal, character by character for every character. â¢ Fuzzy or approximate match, which is intended to deal with typographical variation. At its simplest level, it allows comparison of fields with very simple errors (âSmithâ versus âSmothâ). Fuzzy matching methods can be developed intuitively as seems to be the case in many VRD applications or based on principles that computer scientists have shown to work consistently well in practice. â¢ Content equivalenceââRoadâ and âRd,â or âBillâ and âWilliamâ are treated as equal. The need for such rules arises for many reasons, not the least of which is that when asked for information, people often provide inconsistent information unintentionally. They use nicknames, include or omit middle initials, use abbreviations or not, and so onâand forget what they have done on previous occasions. An area code for a phone number may have changed. A street address might be recorded with digits transposed in the house number, or a street name spelled incorrectly, or with the wrong Zip code. A record-level match occurs when several field-level matches are indicated. The decision about how many field-level matches are needed to define a record-level match is an important influence on the accuracy of the match. For example, a record-level match rule that required only field-level matches on first name and last name would lead to many more false positives than a rule requiring field-level matches on first name, last name, and date of birth. If the former rule were used instead of the latter to remove voters from registration lists (for example, if the voter registration list were compared against a list of state felons), many more eligible voters would be improperly removed.5 (In principle and sometimes in practice, matching algorithms can also consider differences as well as similarities. For example, if the 4 One way to estimate how many combinations exist is to consider that the population of the United States is currently approximately 300 million. The number of possible four-digit SSNs is 10,000. A plausible estimate of the number of distinct birth dates (month, day, year) is perhaps 365 Ã 70 = ~ 25,000. Thus, there are around 250 million possible combinations of birth date and four-digit SSN, which corresponds approximately to about one such combination for every American. 5 An example of such a problem was a case with a record-level match conducted to identify felons in the voter registration database in Florida before the 2000 election. In matching the Florida VRD to a national list of felons, the applicable rule used exact field-level matches on the first four letters of the first name, middle initial, gender, and last four digits of the Social Security number (when available) and used approximate matches for last name (matching on 80 percent of the letters in the last name) and date of birth. Certain name variations were also explicitly taken into account (Willie could match William; John Richard could match Richard John). The result of this match was that approximately 15 percent of the names removed from the VRD were improperly removed. See Gregory Palast, âThe Wrong Way to Fix the Vote,â The Washington Post, Sunday, June 10, 2001, Outlook section, p. 1, available at http://www.legitgov.org/palast_wrong_way_fix_vote.html. To remediate the issues raised in this case, Choicepointâthe firm responsible for conducting the matchâagreed to a very detailed set of criteria described in Box B.1.
APPENDIX B 31 BOX B.1 The Detailed Nature of Match CriteriaâAn Illustration As an illustration of the detail with which match criteria must be specified, consider the following criteria taken from the consent decree in National Association for the Advancement of Colored People v. Katherine Harris, Secretary of State of Florida et al. (Case No. 01-120-CIV-Gold/Simonton, United States District Court, for the Southern District of Florida). Notice of Filing Fully Executed Copy of June 28, 2002, Choicepoint Settlement Agreement . . . 9. The matching criteria described in Paragraph A.8 . . . [are] as follows: ChoicePoint will identify all matches on the comprehensive list resulting from the processing described in Paragraphs A.2-A.7 that do not match based on all of the following data fields: â¢ Validated 9 digit Social Security Number â¢ Non-normalized (i.e., as name appears in original source data) Last Name â¢ Non-normalized (i.e., as name appears in original source data) First Name â¢ Non-normalized (i.e., as name appears in original source data) Middle Name â¢ Suffix â¢ Race â¢ Gender â¢ Date of Birth ChoicePoint will perform Social Security Number validation in accordance with guidelines established by the Social Security Administration. Records will be deemed to match under the criteria listed above if a middle name in one record begins with the same letter as a middle initial shown in the match record assuming all other fields listed above match. Records will be deemed not to match under the criteria listed above if they share common blank data fields among the fields listed above, except for cases in which the middle name field or suffix field is blank in both records. Records will be deemed not to match under the criteria listed above if one of the fields being compared contains data and the same field in the match record contains no data. name and date of birth are the same but the Social Security number and gender values are inconsistent between the records, a nonmatch might be indicated under some circumstances.) States have considerable discretion to decide for themselves the criteria to be used for matching, although these criteria cannot be used to disenfranchise legitimate voters.6 Some states will use fuzzy matching and others exact matching for checking any given data field. States also vary in the fields that they checkâfor example, some will compare addresses and others will not. In general, some election offices may be using match criteria without sufficient consideration of possible false-positive and false- negative error rates with any variants of the methods. 6 A description of the various practices employed by the various states in late 2005 can be found in Wendy Weiser, Justin Levitt, and Ana Munoz, Making the List: Database Matching and Verification Processes for Voter Registration, Brennan Center, New York University, 2006, available at http://www.brennancenter.org/dynamic/ subpages/download_file_49479.pdf.
32 STATE VOTER REGISTRATION DATABASES: IMMEDIATE ACTIONS AND FUTURE IMPROVEMENTS Finally, a manual review of matches is sometimes performed. That is, under some circumstances, a voter registrar will review a match (or a nonmatch) indicated by automated processes. COMPLICATIONS IN MATCHING Apart from the issues involved in the matching criteria, a variety of data issues also complicate matching. Data quality (addressed in more detail in Appendix C) is impaired by many different sources of error, including illegible handwriting, incomplete or lost forms, and keypunching errors. Another problem occurs because certain names are quite common. For example, it is known that the name âJohn Smithâ occurs between 30,000 and 60,000 times in national lists. This means that there are between 1.5 and 3.0 John Smithâs for each date of birth. Assuming there are 500 individuals named John Smith in a given state, then a certain (low) proportion of them will have the same date of birth. With certain other commonly occurring names, some chance agreements on dates of birth would be expected as well.7 This point suggests that more accurate record-level matching will take into account the possibility of chance agreement on date of birth for certain commonly occurring combinations of first and last name, which will in turn require knowledge of the most common names in any given state. Such information can easily be computed from either state-held databases (such as the department of motor vehicles (DMV) or voter registration databases, whichever is of higher quality as indicated by fewer typographical errors, more current entries, and so on) or commercially available databases (such as credit header records8). Matches involving common names may require additional processing (perhaps manual) and involve the use of additional information not contained in databases. For instance, a prior address may confirm a match on a name when date of birth is missing. An e-mail address, phone number, or other corroborating information may confirm a match when there is typographical error in any of the first name, last name, or date of birth. At the same time, using other fields may entail other complications. For example, addresses may be represented differently in different databases; for example, in one database, â123 Main Streetâ represents an address, whereas in another database, addresses are represented in three fields (house number (â123â), street name (âMainâ), and suffix (âStreetâ)). Address standardization is often required to fix this problem. Finally, the above technically oriented comments presume that the databases to be matched against the VRD are in fact available. But in the real world of state voter registration databases, fragmented state control over state social service agencies and departments of motor vehicles, and state/county tensions regarding authority over voter registration, the politics of database availability are at least as challenging as the technology for matching. Achieving integration or interoperability of the information systems of election officials and of other state and/or local agencies may be deeply problematic if strong political leadership is not available to demand cooperation. Database-providing agencies not under the authority of state election officials (whether state or county) may well give low priority to meeting the election needs of the state, resulting in difficulties for state election officials in gaining access without undue delay or difficulty. For example, a database-providing agency may demand that election officials provide a voter registration list in a particular format that is hard or time-consuming to generate before the agency is willing to perform a match between the two databases. A more serious problem occurs when the database-providing agency is made responsible for matching the voter registration data against its own dataâthe agency may be unable to devote serious resources to doing so, 7 See, for example, Michael P. McDonald, âThe True Electorate: A Cross-Validation of Voter File and Election Poll Demographics,â Public Opinion Quarterly 71(4):588-602, 2007. 8 Credit headers refer to information in the credit report such as name, address, and phone number, not the credit history portion of the report.
APPENDIX B 33 or lack the inclination or skills to do the matching properly. An agency may be unmotivated to resolve or address possible interoperability problems. THE POSSIBLE IMPACT OF INADEQUATE RECORD-LEVEL MATCHING According to the EAC report Impact of the National Voter Registration Act on Federal Elections 2005-2006,9 there were 36,277,749 voter applications received by 45 reporting states. Among those received, there were 10,938,385 changes of address or party; 2,196,608 duplicate applications; and 1,138,955 invalid or rejected applicationsâ¯resulting in a total of 17,281,234 new registrants.10 The percentage of applications not entered into the database because they were âinvalid or rejectedâ or âduplicate applicationsâ was about 9 percent, a total of 3,335,563 in the 45 reporting states. For comparison purposes, Table 4b from page 50 of the EAC report indicates that 333,663 people from 34 reporting states were removed from voter registration lists due to presumed felony convictions. Once it is known that an application is not a duplicate, and not just a change of address or party, the application needs to be verified. Table 12, âVerification of Applications,â on page 72 in the EAC report11 shows that each state has its own unique set of criteria for verifying the applications, ranging from states like Pennsylvania, which verifies only through the DMV and the SSA, to Montana, which verifies against the DMV, the SSA, Vital Records, âMatch Against Voter Registration Databases,â âTracking Returned Voter ID Cards,â âTracking Returned Disposition Notices,â and âVerify Through Other Agency.â According to Table 13, âData Fields for Comparison to Identify Duplications,â in the EAC report, 15 states verify using the address; 48 verify the date of birth; 38 verify the driverâs license number; 46 verify the names provided by the registrant; 40 verify âSocial Security numberâ (although surely that is just the last four digits in most cases, since according to Table 11, pages 68-69, in the EAC report, only 7 states use the full SSN); and 10 verify âotherâ data. Consider two points. First, the state with the highest rate of âinvalid or rejectedâ applications (Table 3, p. 38, in the EAC report) also reported in this survey that it verifies application information only through the DMV and the SSA (Table 12). Second, the state reporting in this survey the highest percentage of applications rejected because they were duplicates also reports in this survey that it uses only date of birth and names provided by the applicant to identify duplications (Table 13 in the EAC report). These points do not prove a causal relationship between use of a small number of non-VRD databases or a small number of fields in verification and a high percentage of rejected applications, but presuming that the data reported are valid and accurately reported, these points raise the question of how a broader set of criteria would have changed the percentage of applications rejected.12 9 Available at http://www.eac.gov/clearinghouse/docs/the-impact-of-the-national-voter-registration-act-on-federal- elections-2005-2006/attachment_download/file. 10 The EAC report also notes that it âmay also have under-reported various voter registration activities because several States were in the middle of converting their local voter registration files into a statewide system in 2005. As a result, some States indicated that their local jurisdictions stopped keeping track of various registration functions and activities because they understood the State would be compiling this informationâ (p. 10). 11 In this and the next paragraph, the tables (and page numbers) referred to are in the EAC report Impact of the National Voter Registration Act on Federal Elections 2005-2006, available at http://www.eac.gov/clearinghouse/ docs/the-impact-of-the-national-voter-registration-act-on-federal-elections-2005-2006/attachment_download/file. 12 The committee recognizes that the issue of data validity is an important one. For example, states may have reported their figures using definitions or criteria that were not uniform across all reporting jurisdictions. Issues with terminology are also known to cause difficulties for survey design. Until such matters are resolved, these data can only be regarded as providing tentative indications of possible relationships.
34 STATE VOTER REGISTRATION DATABASES: IMMEDIATE ACTIONS AND FUTURE IMPROVEMENTS AN IMPORTANT EXAMPLE OF MATCHING IN PRACTICE To illustrate the issues described above, consider a record-level match based on exact matches for an individualâs first and last name, the month and year of birth, and the last four digits of an SSN. This example is significant because HAVA requires the Social Security Administration to verify the name, date of birth and the last four digits of the SSN (âthe applicable informationâ) in support of the federal voting process (usually to verify information for first-time voter applicants who do not provide a driverâs license number to be checked against state DMV records), and to notify the voter registrar if the person so identified is deceased. (This requirement does not mean that the SSA mechanism is the only means through which voter information can be verifiedâstates with other mechanisms available to them can select another method. According to the Brennan Center, 24 states in late 2005 planned to use the process described above.13) The requirement of using only the last four digits of the SSN increases the number of false positives, even though the absolute number of false positives is still quite low. The limitation to the use of the last four digits of the SSN reflects a balancing between a more effective matching of records and concerns about privacy. Upon receipt of the applicable information, the SSA queries its database and returns one of five responses: no match found; one unique match, death indicator absent; one unique match, death indicator present; multiple matches found with at least one lacking a death indicator; multiple matches found but all with death indicator. As noted above, the query is based on searching for exact matches on the applicable information. At its November 2007 workshop, the committee heard testimony that this particular strategy for matching was developed by the SSA through the efforts of a working group involving the National Association of Secretaries of State, National Association of State Election Directors, American Association of Motor Vehicle Administrators, and five states. However, to the best of the committeeâs knowledge, no testing of match criteria was conducted in advance of deployment, and the error rates that such a strategy would entail were unknown at the time of deployment. This strategy has a number of limitations that would prevent records from being matched when they should be matched. For example, the search query does not account for content equivalence of names (so that Bill and William are regarded as completely different names). Using only the first and last name causes difficulty, because the number of multiple and compound names is increasing rapidly in the population. In addition, a full legal name was not originally required to obtain an SSN, and thus many SSA records do not contain the full legal names of individuals. Changes in last name (for example, of women who change their last names through marriage) are also problematic, as someone may not report a change of last name to the SSA until it is needed to determine Social Security benefits. In addition, individuals were not required until 1972 to provide SSA proof of identity when applying for an SSN. Finally, individuals may still have been assigned SSNs even if their applications did not contain birth date information. Data provided by the SSA to the committeeâs second workshop in November 2007 indicate that 55 percent of queries result in at least one match being indicated; queries using the full SSN result in a match rate of about 88 percent. The cost per query is at less than one cent ($.0062), which is low enough to allow election officials to vary the queries themselves in the event that a nonmatch response is received (for example, querying on âBillâ if âWilliamâ did not return a match). 13 Wendy Weiser, Justin Levitt, and Ana Munoz, Making the List: Database Matching and Verification Processes for Voter Registration, Brennan Center, New York University, 2006, available at http://www.brennancenter.org/ dynamic/subpages/download_file_49479.pdf.
APPENDIX B 35 As an example of a matching procedure in action, consider the elements of a new voter registration application card as shown on the left below and the SSA record on the right (presume these records are, in fact, supposed to refer to the same person): New Registration Card SSA Record Tom T Bowden Taylor T Bowden 3121 Escondido Way 11/04/77 11/04/77 SSN 000001087 SSN 000001087 In this case, the SSA would return a response of âno match found.â However, if the voter registrar could determine that either Tom has a middle name of Taylor or Taylor has a middle name of Tom or Thomas, then this registrar could associate these records with some degree of confidence if he or she concluded that the first and middle names have been transposed. But in the absence of other information, the registrar has no way to make such a determination. States vary in their treatment of what happens in the event that an applicantâs information cannot be matched against the SSA or DMV databases. In some cases, a state may grant the applicant a conditional registration that requires the voter to present an ID at the polls before voting (indeed, in some states, all first-time voters are required to present an ID at the polls, regardless of whether a match is found); others may provide a provisional ballot to the voter on election day. At the time of this writing, a Washington state law that requires a nonmatch to result in an applicant not being registered is being challenged.14 IMPROVING RECORD-LEVEL MATCHING In general, three approaches can be used to improve record-level matching: allowing more data (that is, using more data fields or more complete data fields in performing the match), improving the quality of the data contained in the relevant databases (including the use of tertiary/external data), and introducing systematic field-level matching algorithms to augment certain locally developed matching techniques. The first approach often runs afoul of privacy concerns, and it requires policy makers to be willing to make a tradeoff between less privacy and better record-level matching. In this case, experiments with using more data fields or more complete data fields are necessary to determine the incremental benefit in record-level matching (for example, adding another field or using the last six digits of the SSN instead of only the last four). The second approach, improving data quality, is addressed in more detail in Appendix C. For purposes of this report, âad hoc matchingâ is used to mean matching developed on the basis of intuitive reasoning that is not further validated systematically or analyzed with mathematical rigor. By contrast, systematic matching is based on a formal mathematical approach that develops metrics to measure match efficacy. With metrics in hand, policy makers can set scales for three relevant areasâ what determines a match, what determines a nonmatch, and what is indeterminate. Implementation of systematic techniques for matching can use some or all of the following elements: 14 See Washington Association of Churches v. Reed, No. C06-0726RSM, 2006 WL 4604854, available at http://projectvote.org/fileadmin/ProjectVote/Legal_Documents/WAC__PI_Decision.pdf.
36 STATE VOTER REGISTRATION DATABASES: IMMEDIATE ACTIONS AND FUTURE IMPROVEMENTS â¢ Use of modern matching techniques (also known in the statistical literature as techniques for record linkage). For example, a model introduced by Fellegi and Sunter15 formalizes ideas of Howard Newcombe based on likelihood ratios in which it becomes somewhat easier to estimate record linkage parameters (even without training data). Training data is a large representative âtruthâ set of truly matching and nonmatching pairs of records. In the Fellegi- Sunter model each pair is given a score (or weight). The higher the score, the more likely a pair is to be a match. â¢ Use of preprocessing to standardize data elements. Preprocessing involves breaking fields into components and standardizing components, and a common preprocessing application is the use of address standardization software in which a house-number-and-street-name type of address may be broken into house number, street name, direction words (such as East, Southwest, and so on), and street type (Drive, Avenue) that are given standard spellings or abbreviations. Other methods can facilitate use of name information.16 Although some of the methods described in this appendix are a good starting point, individual states may need to have specific methods for the types of idiosyncrasies and errors relevant to their individual needs. â¢ Accounting for the relative frequency of occurrence of values of strings such as first and last names. A relatively rare name such as âZabrinskyâ may have more distinguishing power than a common name such as âSmith.â The primary purpose of the frequency-based (or value-specific) matching is to downweight pairs having the more commonly occurring values of strings. If one has a large file representing an entire state, then one can compute the frequency-based scores associated with different strings by comparing the entire file against itself. The entire file becomes the surrogate training data. These ideas were introduced by Newcombe and extended by Fellegi and Sunter17 and by Winkler18 (Box B.2) in demonstrating how to implement frequency-based matching. In production matching software for the Decennial Censuses (1990 and beyond), Winkler had methods that automatically created the frequency-based weights. The distinguishing power of a particular name may vary considerably by geography. In Minnesota, for example, names such as âGarciaâ and âMartinezâ were relatively rarer and given more distinguishing power; in California the names are much more common and given less distinguishing power. â¢ Accounting for minor typographical error (such as âSmithâ versus âSmothâ) and having an automatic mechanism for downweighting the matching scores for pairs of strings that do not agree exactly. Winkler19 provided such a mechanism (Box B.3), which yields significantly improved matching results in comparison to exact character-by-character matching and often outperforms ad hoc methods of âfuzzy matching.â The Jaro-Winkler string comparator is widely used by computer scientists. It is a fast alternative to âedit distanceâ that measures the 15 Ivan P. Fellegi and Alan B. Sunter, âA Theory for Record Linkage,â Journal of the American Statistical Association 64(328):1183-1210, December 1969. 16 See William E. Winkler, âBusiness Name Parsing and Standardization Software,â unpublished report, Statistical Research Division, U.S. Bureau of the Census, Washington, D.C., 1993; and William E. Winkler, âAdvanced Methods for Record Linkage,â Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 467-472, 1994. 17 Ivan P. Fellegi and Alan B. Sunter, âA Theory for Record Linkage,â Journal of the American Statistical Association 64(328):1183-1210, December 1969. 18 William E. Winkler, âFrequency-based Matching in the Fellegi-Sunter Model of Record Linkage,â Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 778-783, 1989. 19 William E. Winkler, âString Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage,â Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 354-359, 1990.
APPENDIX B 37 minimum number of insertions, deletions, and substitutions to get from one string to another and has been extensively vetted using data that are highly similar to DMV and VRD data. â¢ Estimation of optimal matching parameters (probabilities in the Fellegi-Sunter model) for classifying pairs as matches or nonmatches. The probabilities can be computed by comparing an entire state file against itself, using a simple unsupervised learning method such as a properly applied expectation-maximization algorithm,20 or an alternative method.21 The optimal parameters have the effect of better separating matches from nonmatches. Although this improves matching, it does not yield estimates of error rates. â¢ Providing methods for estimating false match rates. Estimates of matching rates vary according to the matching scores (or weights). A certain false match rate will be associated with the designation of all pairs above a value U1 as matches. If all pairs above a value U2 are designated as matches where U2 > U1, then the typical result is a lower false match rate and fewer pairs designated as matches. Belin and Rubin22 and Winkler23 have given unsupervised learning methods for estimating false match rates in situations for which there are no training data. â¢ Providing methods for estimating false nonmatch rates. Estimates of false nonmatches may partially be accomplished via methods of Winkler,24 although these techniques may need to be modified if they are to be used on state DMV and VRD files. â¢ Use of heuristic search strategies to speed up the matching process when necessary. Although most changes to VRDs are incremental, an operation involving entire database-to- database comparisons may sometimes be necessary. If two databases each have 5 million records, the number of possible pairs that must be compared is 25 Ã 1012, a number that is much too large to search with most computer systems available to states. Heuristic strategies may be needed to reduce significantly the number of pairs that must be compared if the databases involved are large. â¢ Use of name rooting equivalency tables that automatically generate common variants of a given name (for example, Bill, Billy, and Will for William). Such tables greatly reduce the need for multiple manual queries using name variants. Implementation of a name rooting at the SSA would benefit all states that verify voter registration information using the SSA. Notably, name rooting could be used as a component of any intrastate query mechanism as well. 20 William E. Winkler, âUsing the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage,â Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 667-671, 1988. 21 William E. Winkler, âString Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage,â Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 354-359, 1990. 22 Thomas R. Belin and Donald B. Rubin, âA Method for Calibrating False-Match Rates in Record Linkage,â Journal of the American Statistical Association 90(430):694-707, 1995. 23 William E. Winkler, âAutomatically Estimation Record Linkage False Match Rates,â Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM. Also available at http://www.census.gov/srd/papers/pdf/rrs2007-05.pdf. 24 William E. Winkler, âMatching and Record Linkage,â pp. 355-384 in Business Survey Methods, Brenda G. Cox et al. (eds.), Wiley, New York, 1995; William E. Winkler, âApproximate String Comparator Search Strategies for Very Large Administrative Lists,â Proceedings of the Section on Survey Research Methods, American Statistical Association, 2004.
38 STATE VOTER REGISTRATION DATABASES: IMMEDIATE ACTIONS AND FUTURE IMPROVEMENTS BOX B.2 Accounting for Commonly Occurring Names The earliest computerized record linkage methods1 do effectively account for the commonly occurring name plus âchanceâ date-of-birth phenomenon. Newcombeâs matching classification rule was to use the fields in pairs of records to compute a matching score. The idea was that agreement on individual fields was more likely to occur among âtruly matchingâ pairs. Pairs above a certain upper bound were designated as matches; pairs below a certain lower bound were designated as nonmatches; and pairs with in-between scores were held for clerical review (when auxiliary information might be used to fill in missing information or âcorrectâ contradictory information). If the upper bound is raised, then the false positive (false match) rate decreases. If the lower bound is decreased, then the false negative (false nonmatch) rate decreases. The frequencies (probabilities) used in computing the scores can be estimated a priori using the frequencies in the large administrative lists, recognizing that matters such as âthe list of most common namesâ will change slowly over time (which requires periodic adjustment of that set and the probabilities that those names will occur). Efficiently computed frequencies (conditional probabilities) are optimal in the sense that they can minimize the size of the clerical review region. Further, in many situations such as with voter registration databases or department of motor vehicle files, it is possible to estimate or give reasonable approximations of the error rates even without training data.2 The earliest matching parameter and error-rate estimation procedures are the easiest to implement and most likely appropriate for VRD files. The most general version of the parameter estimation procedures3 generalize the iterative scaling procedures of Della Pietra et al.4 The frequency-based methods5 automatically adjust match scores downward for the most frequently occurring first and last names. The effect of the downward adjustment is that pairs of records that are associated with commonly occurring names such as âJames Smithâ fall into an indeterminate region in which additional information (possibly via clerical review and callbacks) is required to determine matching status. In many situations, it is straightforward to obtain the extra matching information for the indeterminate pairs. Most other (much less commonly occurring names) can be matched effectively because the false positive rate is much less than 0.004 percent when using the combination of name, date of birth, and last four digits of the SSN (that is, typically they uniquely identify). If the state VRD files can be examined a priori, then for each common first-name-last-name combination, we can find the most frequent dates of birth and lower the matching score of the associated MATCHING RECORDS WITH UNIQUE IDENTIFIERS Many of the difficulties described above can be reduced or eliminated through the use of a unique identifier (UID) for every voter, such as a driverâs license number. If every voter has a single UID, records for a voter can be matched more simply. In practice, even UIDs are sometimes improperly keyed in transcribing from a handwritten application or improperly recorded on the application (for example, because digits were transposed or one digit is illegible). If there is an error in the UID, a search could be performed using the name and the date of birth to find all possible UIDs associated with those names and dates to find the UID that is most similar to the one recorded in errorâthat UID would likely be the âcorrectâ UID for the person in question.
APPENDIX B 39 pairs of records. We first lower the matching score for the common name combination and then again for the common dates of birth. To match the pairs with the lowered matching scores, we would need additional corroborating information such as telephone number or middle initial. If driverâs license number or the last four digits of the SSN are available, then we can use the string comparators to check whether the pairs of corresponding numbers are almost the same. The corroborating information might vary somewhat in differing states. In particular, some states request e-mail address. In this situation, it is possible to repeat analogous procedures to raise the worst-case false positive probabilities for certain specific name-date-of-birth combinations while significantly reducing the false match probabilities associated with the same name but different dates-of-birth combinations. This approach has the effect of significantly increasing the number of pairs of records for which match status can effectively be computed. 1 Howard B. Newcombe et al., âAutomatic Linkage of Vital Records,â Science 130(3381):954-959, October 1959; Howard B. Newcombe and James M. Kennedy, âRecord Linkage: Making Maximum Use of the Discriminating Power of Identifying Information,â Communications of the Association for Computing Machinery 5(11):563-566, November 1962. 2 William E. Winkler, âComparative Analysis of Record Linkage Decision Rules,â Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 829-834, 1992; William E. Winkler, âImproved Decision Rules in the Fellegi-Sunter Model of Record Linkage,â Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 274-279, 1993; William E. Winkler, âAutomatically Estimation Record Linkage False Match Rates,â Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM, 2006, also at http://www.census.gov/srd/papers/pdf/rrs2007-05.pdf ; Thomas R. Belin and Donald B. Rubin, âA Method for Calibrating False-Match Rates in Record Linkage,â Journal of the American Statistical Association 90(430):694-707, 1995. 3 William E. Winkler, âOn Dykstraâs Iterative Fitting Procedure,â The Annals of Probability 18(1):1410-1415, July 1990; William E. Winkler, âImproved Decision Rules in the Fellegi-Sunter Model of Record Linkage,â Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 274-279, 1993. 4 Stephen Della Pietra et al., âInducing Features of Random Fields,â IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4):380-393, April 1997. 5 Howard B. Newcombe et al., âAutomatic Linkage of Vital Records,â Science 130(3381):954-959, October 1959; Howard B. Newcombe and James M. Kennedy, âRecord Linkage: Making Maximum Use of the Discriminating Power of Identifying Information,â Communications of the Association for Computing Machinery 5(11):563-566, November 1962. A more general strategy would be needed when there is a possibility of typographical error in every field. The matching strategy is to search the entire file and apply suitable proximity metrics that indicate that the UID, first name, last name, and date of birth are sufficiently close to the query record. The feasibility of this strategy depends on the frequency with which invalid UIDs are encountered, because it is not practical to sequentially read every record in the database and perform substantial computation on every record in the file for every query. The most general strategy involves substantial restructuring of the database to facilitate fast searches. Keys such as first character of first name plus last name plus date of birth, telephone number, or house number plus street name are defined and added to the database to allow fast searches. Using all appropriate fields, only records with proximity scores sufficiently close to the query record are retrieved for review. Definition of the keys and the order in which they are applied requires certain experience and skill.
40 STATE VOTER REGISTRATION DATABASES: IMMEDIATE ACTIONS AND FUTURE IMPROVEMENTS BOX B.3 Blocking and String Comparators The two methods for dealing with minor typographical variation are blocking and string comparators. The idea of blocking was to search on given characteristics and use remaining information to compute matching scores. For instance, a search might be performed on first initials âJâ and âSâ and year of birth to retrieve records for which all remaining information is considered to compute a matching score against a record in another database for John Smith. A string comparator allows computation of a value for partial agreement for two strings. For instance, a comparison of âJohnâ with âJohnâ might yield a value of 1.0; a comparison of âJohmâ with âJohnâ might yield 0.90; and a comparison of âSmithâ with âSmethâ might yield 0.94. The overall matching score can be reduced from the score associated with exact character-by- character agreements on individual fields to account for the partial agreements. Widely used string comparators are edit distance and the Jaro-Winkler string comparator.1 Code for both methods is widely available on the Internet. Independent verification has consistently shown that the Jaro-Winkler comparator is 10 times as fast as edit distance and returns equally high-quality results with administrative lists of the types that are similar to voter registration databases or department of motor vehicle files. Other technical approaches to blocking and string comparators can be found in Fienberg et al.2 1 William E. Winkler, âString Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage,â Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 354-359, 1990; William E. Winkler, âOverview of Record Linkage and Current Research Directions,â Statistical Research Division, U.S. Bureau of the Census, Washington, D.C., 2006, available at http://www.census.gov/ srd/papers/pdf/rrs2006-02.pdf. 2 William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg, âA Comparison of String Metrics for Matching Names and Addresses,â pp. 73-78 in Proceedings of the Workshop on Information Integration on the Web, International Joint Conference on Artificial Intelligence, Acapulco, Mexico, August 2003; William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg, âA Comparison of String Distance Metrics for Name-Matching Tasks,â Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington D.C., August 2003.