provide or a lack of recall about what s/he entered on a previous occasion. In other cases, the information requested may have changed (names sometimes change upon marriage, for example).

  • Data entry errors. Typographical errors are made by hitting one key when another was intended. Transposition errors transpose two letters in a field, or even two fields. Even with carefully handwritten registration forms, it is possible that transcription/keying error may approach 5 percent or more in fields such as first name, last name, and date of birth if the data entry clerks lack adequate training and monitoring.3

  • Systematic errors stemming from different data representation conventions. Among the most important are those associated with dates and names.

    • In many countries (including most of Europe), 01/03/2007 means March 1, 2007, whereas in the United States it means January 3, 2007. A naturalized U.S. citizen is perhaps more likely to make such a mistake than an individual raised in the United States.

    • In many Asian nations, the family name is always stated first. Kim Jong-il is a Korean name; the family name is Kim, and the given name is Jong-il. However, it would be easy for an American to recognize Kim as a first name, perhaps as an abbreviation for Kimberly, and Jong-il as a last name.

    • Names normally rendered in an alphabet other than a Roman alphabet may well be spelled inconsistently when transcribed into a Roman alphabet. This problem is of particular concern to those of Russian, Asian, Israeli, and Arabic descent.

    • Hispanic naming conventions are complex and very difficult to fit into a conventional “first name, middle name, last name” structure. The complexities include:4

      • Marriage-related name changes for females (Appelido de Casada) and/or widowhood (viuda de, v. de);

      • Incomplete collection of all surnames, due to bearer preference or to data collection constraints;

      • Inconsistent white-space placement, causing merger of phrasal prefixes (DE LA, DELA) and/or merger of prefix and surname stem (DE LA FUENTE, DELAFUENTE);

      • Use of initials in surnames, especially for high-frequency matronymic elements (RODRIGUEZ DE G.);

      • Use of familiar/nickname forms of given names (FRANCISCO-PACO);

      • Use of orthographic shortened forms of given names (FRANCISCO-FCO, MARIA-MA);

      • Presence of a surname from a non-Hispanic culture in an otherwise Hispanic name which continues to follow Hispanic nomenclature patterns.

These factors generate a wide range of errors. Table C.1 summarizes a variety of error types that may also exist in name fields; Table C.2 describes some possible errors in date-of-birth fields. Voter registrars are left with the problem of managing an environment in which such errors are common.

Problems with data capture and errors in the voter registration database can have an important effect on the individuals whose data are involved. The voter believes that he or she is properly registered, but the registration may have been rejected as a result of the inaccurate, incomplete, or illegible information on the form, or the voter may not know to bring to the polls on Election Day the additional identification required because of a problem with his or her form. In some cases, the voter may be entirely absent from the voter registration rolls.

3

See Joseph J. Pollock and Antonio Zamora, “Automatic Spelling Correction in Scientific and Scholarly Text,” Communications of the ACM 27(4):358-368, April 1984. In a highly controlled situation, keying error rates were in excess of 2 percent (in keystrokes). A 1-2 percent error rate in keystrokes could easily yield a 5 percent error rate in fields.

4

Leonard Shaefer, Chief Scientist, IBM Global Name Recognition, personal communication to the committee, e-mail to Jeff Jonas of August 31, 2009.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement