Unduplication in the 2000 Census
Aside from precensus editing of the MAF, the only unduplication program explicitly planned to take place as part the conduct of the 2000 census was application of the Census Bureau’s primary selection algorithm (PSA). The census provided multiple chances for inclusion (among them, the regular census mailout, Internet response, nonresponse follow-up, unaddressed “Be Counted” forms, and foreign language forms), and the PSA’s function was to determine which persons and information to retain from the set of records bearing the same MAF ID number.
In all, 8,960,245 MAF IDs had more than one eligible return (representing just less than 8 percent of the IDs on the Decennial Response File, the rawest compilation of collected census data); more than 95 percent of these IDs had exactly two returns associated with them, and 55 percent of those had two enumerator returns associated with them. The exact mechanics of the PSA are confidential, so that only a brief executive summary of the Bureau’s evaluation of the PSA’s performance is publicly available (Baumgardner, 2002), with additional results presented by Alberti (2003). What is known is that the algorithm involves grouping the set of people on a set of records into interim PSA households, with some checking of duplicates using person matching; it is also known that the census residence rules are not used in analyzing the person records possibly associated with a household, since Baumgardner (2003:iii) comments that the “[PSA] itself cannot take those rules into account when making decisions.”
The Bureau carried out an ad hoc operation to identify duplicate housing units in June 2000. Internal evaluations from the first few months of the year compared the count of housing units on the MAF to estimates generated by using building permits and other sources; those analyses suggested sizable duplication in the MAF records. The operation flagged 2.4 million housing units (containing 6 million person records) as potential duplicates; these were temporarily removed from the census file. After further review, 1 million housing units (2.4 million people) were reinstated to the census file, and the rest were permanently deleted.
Estimation of erroneous enumerations, including duplicate records, were a major focus of the Bureau’s work in the postcensus Accuracy and Coverage Evaluation (A.C.E.) Program. Bureau staff performed person-record matches based on name and birthdate in two waves. The Person Duplication Studies (summer 2001) matched the A.C.E. samples (two samples of approximately 700,000 records each, one of which is a direct extract from the census for selected blocks) to the nationwide census results. The Further Study of Person Duplication (summer 2002) did the same level of matching, but with revised methodology.
Subsequently, Bureau researchers have extended the scope and methodology of the work, matching the entire census person-level file to itself to identify potential duplicates. This work has raised the possibility of incorporating real-time unduplication into the census process in 2010, performing the same type of nationwide matching for batches of records to identify candidates for field follow-up. The 2006 and 2008 operational tests are intended, in part, to help resolve some remaining questions about the operation, such as the ideal timing of the operation and the sequencing of a coverage follow-up interview process (meant to consolidate multiple operations from 2000, as well as provide input to unduplication) with the coverage evaluation interviews.