7
Administrative Records: Looking to the Future

Despite formidable obstacles to their use, administrative records offer a tantalizing potential for reducing the costs and respondent burden associated with many of the operations involved in conducting a decennial census. Some have envisioned the possibility of an "administrative records census," perhaps as soon as 2010 (Edmonston and Schultze, 1995).

In "The Plan for Census 2000," the Census Bureau (1996) announced its intention to use administrative records in the following ways:

  •   to update the Master Address File;

  •   to assist with the enumeration of special population groups (such as American Indians on reservations and in large cities, Alaskan Natives, people in group quarters, and people in remote areas);

  •   to provide indirect responses for about 5 percent of the nonresponding households prior to nonresponse follow-up;

  •   to augment the household rosters used in integrated coverage measurement interviews; and

  •   to impute missing items on the long form.

In parallel with the 2000 census activities, the Census Bureau planned to experiment with the development of a complete administrative records census in several sites. Finally, in addition to these announced uses, the Census Bureau has supported research on the use of administrative records as a component of the estimation procedure that will be used to impute census records for the 9-10 percent of households that do not return questionnaires and are not enumerated during nonresponse follow-up.

In December 1996 the Census Bureau announced that administrative records would not be used to derive the census count for nonresponding households--the third item on the list above. It cited a number of reasons for this decision, including the need for much additional research. All of the other planned applications would proceed, however, and research on the quality and coverage of administrative records would continue. Previous National Research Council panels have recommended that the Census Bureau make greater use of administrative records in preparing for and conducting the census (National Research Council, 1993; Steffey and Bradburn, 1994; Edmonston and Schultze, 1995), and we, too, endorse the Bureau's efforts in that direction.

Efforts by the Census Bureau in its administrative records research for the 2000 census have included acquiring numerous files of administrative records data from federal, state, local, and commercial sources; using these data to build an administrative records database for each of the three sites in the 1995 census test; evaluating the quality of these administrative data; and assessing the usefulness of administrative records in applications related to addressing census nonresponse and improving coverage. This



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 62
Preparing for the 2000 Census: Interim Report II 7 Administrative Records: Looking to the Future Despite formidable obstacles to their use, administrative records offer a tantalizing potential for reducing the costs and respondent burden associated with many of the operations involved in conducting a decennial census. Some have envisioned the possibility of an "administrative records census," perhaps as soon as 2010 (Edmonston and Schultze, 1995). In "The Plan for Census 2000," the Census Bureau (1996) announced its intention to use administrative records in the following ways:   to update the Master Address File;   to assist with the enumeration of special population groups (such as American Indians on reservations and in large cities, Alaskan Natives, people in group quarters, and people in remote areas);   to provide indirect responses for about 5 percent of the nonresponding households prior to nonresponse follow-up;   to augment the household rosters used in integrated coverage measurement interviews; and   to impute missing items on the long form. In parallel with the 2000 census activities, the Census Bureau planned to experiment with the development of a complete administrative records census in several sites. Finally, in addition to these announced uses, the Census Bureau has supported research on the use of administrative records as a component of the estimation procedure that will be used to impute census records for the 9-10 percent of households that do not return questionnaires and are not enumerated during nonresponse follow-up. In December 1996 the Census Bureau announced that administrative records would not be used to derive the census count for nonresponding households--the third item on the list above. It cited a number of reasons for this decision, including the need for much additional research. All of the other planned applications would proceed, however, and research on the quality and coverage of administrative records would continue. Previous National Research Council panels have recommended that the Census Bureau make greater use of administrative records in preparing for and conducting the census (National Research Council, 1993; Steffey and Bradburn, 1994; Edmonston and Schultze, 1995), and we, too, endorse the Bureau's efforts in that direction. Efforts by the Census Bureau in its administrative records research for the 2000 census have included acquiring numerous files of administrative records data from federal, state, local, and commercial sources; using these data to build an administrative records database for each of the three sites in the 1995 census test; evaluating the quality of these administrative data; and assessing the usefulness of administrative records in applications related to addressing census nonresponse and improving coverage. This

OCR for page 62
Preparing for the 2000 Census: Interim Report II chapter reviews those efforts in terms of the planned uses of administrative records for the 2000 census. The first section discusses research on the development of an administrative records database, which is a prerequisite to most of the planned uses; the second section discusses each of the prospective applications. BUILDING AN ADMINISTRATIVE RECORDS DATABASE Using administrative records for census activities requires that the files obtained from different national, state, and local sources be combined into a single database. The challenge in building such a database is to reduce multiple files that overlap in their coverage and content to a single record with one name, one address, and one set of demographic characteristics for each covered person. This involves considerably more than just "unduplicating" records. It requires addressing the uncertainty created by the possible presence of multiple addresses, alternative variants of names, and perhaps even different data on age, sex, race, and Hispanic origin for the same individual. The construction of an administrative records database involves several distinct steps, which include the identification, requisition, and acquisition of files from their respective sources; the reformatting, standardization, geocoding, and unduplication of each source file; the pooling of the resulting files into a single database; and the reduction of this database to a single record per unit, whether that be housing units, people, or both. How well this is done will affect the overall quality of the database. Ideally, the database construction will improve on the accuracy of even the best source files, since no file is perfect. This outcome is not automatic, however, and it may be very difficult to achieve. Privacy and Confidentiality It is important to note that all information collected by the Bureau of the Census is confidential. It is protected by Title 13 of the U.S. Code, which both requires people to respond to the census and protects their privacy. No one outside the Census Bureau can see data collection instruments or use them to link individual data with names and addresses. This protection covers any and all information gathered from administrative records, as well as that collected in the field. All of the information collected for research purposes for the census is also protected. The Census Bureau may use the data only for statistical purposes and may release it only in a format that protects privacy and confidentiality. The Census Bureau enforces Title 13 very strictly, and it is the overriding reason that agencies that do not usually share information (such as the Internal Revenue Service and the Social Security Administration) have agreed to do so for the purposes of the census. Neither the Census Bureau nor any outside group, including this panel, has ever proposed that any unique identifier--such as Social Security numbers--be collected in any U.S. census. The Census Bureau has been and remains dedicated to keeping individual data confidential and private, and we believe it has diligently enforced Title 13.

OCR for page 62
Preparing for the 2000 Census: Interim Report II File Acquisition In preparation for the 1995 census test, the Census Bureau requested numerous administrative files for the geographic areas covered by the three test sites. These requests were made to government agencies at the federal, state, and local levels. The mere acquisition of these files proved to be a significant undertaking. Requests for some files--most significantly, those from the Aid for Dependent Children (AFDC) program--were denied. Other files that were supplied had to be recreated or documented more adequately before they could be used. One of the more promising files--the Information Returns Master File from the Internal Revenue Service (IRS)--could not be used within the time frame of the test. Nevertheless, the Census Bureau obtained and used most of the files that it requested, and it also purchased a number of commercial files (Neugebauer, 1996). Perhaps the magnitude of the operational difficulties that would be encountered in trying to use so many files from so many diverse sources should have been anticipated. The lessons learned were valuable, nonetheless. Despite the potential coverage improvement to be gained by adding more and more files, the effort expended in obtaining and working with so many files made clear that "more data" are not necessarily better data. For the potential coverage improvement that they provide, additional files multiply the problems associated with basic processing and unduplication, and they risk creating false additions. The prospect of confronting these same difficulties many times over was indeed a daunting one. In response to these lessons and the mixed findings from the initial ("stage I") evaluation, the Census Bureau identified a small number of files that were judged to provide nearly as much value as the entire set of files requested for the 1995 test but that would require substantially fewer resources to acquire and process. Using only these nine administrative files in each test site, the Census Bureau conducted a second ("stage II") evaluation of the data obtained from administrative records (Wurdeman, 1996)11. The construction of this second database and the findings from the evaluation are what we address below. Processing of Source Files Each of the administrative files was reformatted, as necessary, to provide person-level records in a common layout, and the data were subjected to name and address standardization, geocoding, and Social Security number verification (Wurdeman, 1996). Name and address standardization were needed to facilitate the unduplication of records, both within and across files (see below). Geocoding included attaching census geographic 11   For stage II, the subset of files obtained included: the Social Security Administration NUMIDENT file, the Internal Revenue Service Tax Year 1993 Master File; two files from the Department of Housing and Urban Development--Multi-Family Tenant Certification System (MTCS) and Tenant Rental Assistance Certification System (TRACS); food stamps; Medicare; drivers' license; school enrollment; voter registration; and parolees/probationers.

OCR for page 62
Preparing for the 2000 Census: Interim Report II codes to each record and assigning a housing unit identification number, to be used in aggregating the person-level records into households. Records with Social Security numbers were matched to the Social Security Administration (SSA) NUMIDENT file to verify that the numbers had indeed been issued to the individuals named in the records. Geocoding was limited to street addresses. A post office box could not be geocoded. A record with a post office box listed as an address was retained in the file, however, in the event that the record could be matched to another record that included the missing street address but lacked complete or correct information on other characteristics. Social Security number verification provided an opportunity to obtain age, sex, race, and Hispanic origin data if they were not present on the source file. All of these variables are reported on the NUMIDENT file, albeit with an important limitation: Hispanic origin is not recorded separately from race but, rather, as one of the possible values of race. For people identifying themselves as Hispanic, then, the race variable must be treated as missing or assigned the value ''other." It appears that the Census Bureau may have done the latter, which would contribute to the measured disagreement between the constructed database and the census data collected at the test sites. Combining Files The multiple files were merged and sorted by household unit identifications to create a preliminary household file. Person-level unduplication, which was done first within and then across households, reduced the database to one record per person. In the course of this unduplication, if a variable was found to be reported with two or more different values for the same person (addresses and spelling of names having the most discrepancies), only one value was selected to represent that variable. Except when data were missing, incomplete, or could not be coded, the Census Bureau assigned the value from the source file that carried the highest priority, according to a scheme established on the basis of empirical findings (see below) and other considerations. Other than assigning a "conflict code" when discrepant values of certain variables were observed, the Census Bureau did not preserve in the final database any of the alternative values recorded in the source files. One consequence of person-level unduplication was that some reported addresses were eliminated from the database. Quite obviously, the way in which an administrative records database is constructed can have major effects on the quality of the information it contains. For this reason, we contend that an administrative database that is built for research purposes should be designed and constructed in such a way that it can be used to evaluate alternative approaches to combining the information from the source files. While the Census Bureau's strategy of obtaining a single value for each variable is probably the only realistic approach for the 2000 census, the discrepant values represent information about uncertainty, and there may be ways to capitalize on this feature of administrative records in the future. For current analytic tasks, the file obtained by the strategy of assigning one value rather than multiple values is very useful. However, not preserving the multiple values in the final database limits its usefulness for research on improving its

OCR for page 62
Preparing for the 2000 Census: Interim Report II construction. To test another variant of the choice algorithm, or to assess the consequences of not using one or more of the source files, a researcher must create an entirely new database. Recommendation: The Census Bureau should build an administrative database that preserves the values of important variables from more than one source file to allow flexibility in evaluating alternative matching rules. Evaluation of the Administrative Data The Census Bureau's evaluation of the administrative records data included separate assessments of the quality of data reported in the individual source files and in the merged database. The evaluation relied on a comparison of the administrative records data with the census test data collected at the three sites. However, differences in coverage between the census test and the administrative records assembled for the evaluation make it difficult to draw simple conclusions from these comparisons. Before reviewing the findings, we outline some of these differences and their implications. The 1995 census test was not a complete enumeration of any of the three test sites. The Census Bureau enumerated only a sample of the households that did not self-report by mail or telephone (i.e., the nonresponse follow-up households). The housing units that were enumerated by mail or in the field or identified as vacant constituted 59 percent of the total units in Paterson, 71 percent in Oakland, and 55 percent in Louisiana.12 Other factors complicate the evaluation of administrative records even further. The requests for administrative files failed to identify four zip codes from the Oakland test site and one from Paterson. The omitted Paterson zip code contained only four housing units, but the omitted Oakland zip codes contained 8,000 units. Having no administrative records for these areas, the Census Bureau excluded them from its matching so that they would not be counted as unmatched cases. However, the Census Bureau did not exclude the administrative records that could not be matched because the households to which they referred were not enumerated. In addition, one or more of the sites included only parts of some zip codes. It does not appear that the Census Bureau attempted to remove from the administrative records database any addresses that fell into those portions of zip codes that were not included in the test sites. We presume, however, that if an agency submitted a file that included any zip codes that were entirely outside the test sites, the Bureau would have first removed those zip codes before processing and analyzing these records. Finally, some of the administrative files-particularly the local ones--appear to be incomplete. For example, the voter registration file in Oakland contained fewer than 8 percent as many records as the IRS file, while it 12   These fractions were calculated from tabulations reported by Wurdeman (1996). To the panel's knowledge, neither in this nor in any other report has the Census Bureau reported the percentages that were ultimately enumerated or identified as vacant, so their accuracy is uncertain.

OCR for page 62
Preparing for the 2000 Census: Interim Report II was 50-75 percent as large as the IRS files at the other two sites. For some purposes, one would like to use the total number of households or people in the administrative records database or individual files as denominators for rates. In some cases, the Census Bureau calculated such rates. Because of the broader coverage of the administrative files, however, such calculations understate the relative frequency with which administrative records could be matched to census records. Without an accompanying explanation, these match rates may be misinterpreted as evidence of poor quality in the administrative records. Quality of the Individual Source Files To evaluate the quality of the short-form items on different source files, the Census Bureau matched person records from each source file to the census database in each of the test sites. This was done prior to combining the source files, and the results contributed to the development of the priority scheme used in combining the files. Person records were matched on the basis of name, age, and sex. Using only complete matches, so as not to bias the evaluation with even a small number of mismatches, the Census Bureau then compared the address, race, and Hispanic origin on the administrative files with the values collected during the census test. The proportion of people in the administrative files could be matched to census records is lowered not only by the restriction to complete matches on name, age, and sex, but also by the fact that the census test included only a sample enumeration of the nonresponse follow-up cases. The match rates, which were rather consistent across the three sites, ranged from 40 to 60 percent for most of the administrative source files: Medicare records had the highest match rate in every site, ranging from 53 to 61 percent; IRS records were generally second, with match rates between 40 and 49 percent. Given the exclusion of much of the nonresponse follow-up caseload, comparisons of the source files with respect to match rates may be misleading. There is ample evidence that nonresponse follow-up households are quite different from households that respond to the questionnaire or otherwise self-report. Therefore, source files with large numbers of nonresponse follow-up households or people would have artificially conservative match rates. Among matched person records, the Census Bureau found large differences across the sites and across the source files with respect to the quality of the data on address, race, and Hispanic origin. Address Because of their purposes, administrative records may include addresses that represent any number of things other than residences as defined by the Census Bureau, such as post office boxes or other types of mailing addresses, business addresses, or even addresses of tax preparers or accountants. They may include misspellings and other errors or lack unit numbers, which are often not needed for mail delivery. They may

OCR for page 62
Preparing for the 2000 Census: Interim Report II correspond to where a person wants to appear to live rather than actually lives. Finally, they may be out of date. The relative frequency of these problems differ by source file, as administrative agencies use addresses in different ways and have different mechanisms and schedules for updating them. In evaluating the addresses on administrative records, one can ask the following questions about the addresses per se. Are they street addresses as opposed to post office boxes or delivery addresses? Are they residential? Do they correctly identify a street and number? Do they include unit numbers if required? If the addresses are good by all of these standards, one can then ask if they correctly identify the census day residences of the people with whom they are identified. The Census Bureau analyzed address data for the six major source files (Medicare, IRS tax returns, food stamps, voter registration, driver's license, and school enrollment). In Paterson, the percentage of matched administrative and census people with the same basic street address (that is, they could be located within the same building although not necessarily the same housing unit) ranged between 38 and 79 percent, with school enrollment the lowest and Medicare the highest. In Oakland, the frequency of agreement on basic street address varied between 14 and 53 percent; in Louisiana, the range was only 8 to 19 percent. The very low rates of agreement in Louisiana reflect the preponderance of rural addresses in that test site. In rural Louisiana the mailing addresses used in the administrative files--and by the postal service, for that matter--differ from the street addresses used by the census. In the other two sites, the rates of agreement indicate that there are problems with the addresses in addition to their timeliness. Nationally, about one in five housing units turns over during a 12-month period. Except for Medicare data in Paterson, however, all of the files disagreed with the census address for substantially more than 20 percent of households. Two records may agree on a basic street address, but if the address happens to be a multi-unit building, the two records may differ in their unit numbers, or one may not have a unit number. In Paterson, for matched people with street addresses, when addresses included unit numbers in one or the other file (census or administrative), the unit numbers agreed only 28 percent of the time between the census data and the best administrative file. At the opposite extreme, the food stamp file in Paterson appears to have no unit numbers, for there were no matches to the census unit numbers. In Oakland, the addresses in multi-unit buildings agreed on the unit number between 67 and 85 percent of the time, depending on the administrative file. In Louisiana, agreement on unit number ranged from 17 to 70 percent across the administrative files. Clearly, there is enormous variability across files and sites in the quality of the unit numbers. In Oakland and Louisiana some of the files have good if not perfect reporting of unit numbers, but in Paterson none of the files contains very good unit identification. To explain the low match rates on unit number in Paterson, where multi-unit buildings are the dominant mode of housing (greatly compounding the problem), the Census Bureau noted that nearly two-thirds of the multi-unit buildings in Paterson contained only two to four units, while only one-third of the multi-unit buildings in Oakland were this small (Wurdeman, 1996). Such small buildings are more likely to have nonstandard unit identifications than larger buildings. In addition, residents may be less likely to report unit designations if the mail is delivered without them.

OCR for page 62
Preparing for the 2000 Census: Interim Report II Understanding the reason for the relatively low match rates in Paterson does not provide a solution, however. What it does provide is a basis for predicting comparably low match rates between administrative addresses and census addresses in other communities with similar housing stock. The problems encountered with administrative addresses in Louisiana attest to a different type of limitation of administrative record addresses, namely, that census-type addresses are not useful for most administrative purposes when the Postal Service uses an entirely different set of addresses for mail delivery. The only administrative databases that are likely to help with this problem are those that require identification of housing units per se. Utility companies would need such identification and would very likely maintain both property and mailing addresses--the latter for billing purposes. Whether addresses from this source are ultimately usable remains to be established. What is clear is that national and even state files are not going to provide census-type addresses of uniform quality across all geographic areas and types of communities. This implies a need to supplement the major national and state files if the use of property addresses for census enumeration is to be maintained. We suggest that the Census Bureau conduct research to explore such supplementation--perhaps even starting with some of the databases that were acquired for the 1995 census test but excluded from the stage II evaluation. Race and Hispanic Origin The Census Bureau examined both the presence of a race or Hispanic origin entry and how well it compared with what was reported in the census. Medicare files were the only files on which race and Hispanic origin were virtually always present (99 percent of the time at all three sites). For the IRS files, for which race and Hispanic origin were obtained by matching Social Security numbers to SSA records, race entries were present on 75 and 79 percent of the records in Oakland and Louisiana, respectively, but only 56 percent of the records in Paterson. The figures for Hispanic origin were comparable, except that Paterson had entries for 75 percent of the records. The other large files had race and Hispanic origin entries that ranged from 0 to 100 percent of their records, with the state and local files varying widely across the three sites. When race was reported on both an administrative record and a matched census record, the rate of agreement generally varied between 80 and 93 percent, with the files in Louisiana being at the high end and those in Oakland at the low end. Clearly, the racial and Hispanic composition of the local population was a factor, with match rates being lower when Asians and Hispanic people were relatively numerous. This relationship was attributed to how respondents used the "other" race category in both the administrative records and the census. In addition, Social Security card applications prior to 1980 included only "white," "black," and "other'' as options. To the extent that the administrative measure of race and Hispanic origin was obtained from SSA records, the agreement between census and administrative files would be expected to decline as the proportion of the population that was Hispanic or neither black nor white increased. Agreement between the census and administrative files with respect to Hispanic origin

OCR for page 62
Preparing for the 2000 Census: Interim Report II was comparable to what was observed for race, for site and source file differences as well as overall magnitudes. Data on race and Hispanic origin are sometimes missing from SSA records. An earlier panel reported that race was missing from about 15 percent of the Social Security numbers issued between 1980 and 1991 and between 1 and 3 percent of the numbers issued earlier (Edmonston and Schultze, 1995). Race is now being obtained for infants who are issued numbers on the basis of information provided on the birth record, a mechanism that was established in the early 1990s and that now encompasses most newborns. However, deficiencies of the SSA records can account for no more than one-half of the missing race entries on IRS records in Oakland and perhaps only one-quarter of the missing entries in Paterson. Problems in matching taxpayers' Social Security numbers to SSA records must account for the balance of the missing race data, but this is inconsistent with the fact that fewer than 1 percent of the numbers that taxpayers report on returns fail IRS validation. Because of the importance that will be assigned to SSA records as a demographic supplement to other administrative records, the match rates obtained in the test sites bear further review. We believe the Census Bureau should include this on its administrative records research agenda. Recommendation: The Census Bureau should give further review to the Social Security number match rates from the 1995 census test. Discovering the reasons for apparent failure to match Social Security Administration (SSA) records will provide valuable guidance and ensure the utility of the SSA data in the administrative records database. Quality of the Database The findings reported by the Census Bureau do not address the quality of the administrative records--only the extent of their agreement with census values. To assess the comparative quality of administrative and census data would require revisiting a sample of households for which both were reported. In our view, this is not a short-term need, but something that should be considered for the evaluation of administrative records that is conducted in conjunction with the 2000 census. In combining the multiple source files, eliminating duplicates, and selecting among alternative reports of addresses and other characteristics, the Census Bureau could obtain data that are of higher quality than those in most, if not all, of the individual source files. For example, missing unit numbers could be filled in from addresses obtained from lower priority files. Likewise, post office box numbers could be replaced by actual street addresses. But if an old or otherwise inaccurate basic street address is reported on the highest priority file, this address would be retained. In the absence of additional information, there is no mechanism for determining that an address reported on a lower priority file is more correct than an address reported on a higher priority file. Similarly, incomplete reports of race or Hispanic origin can be replaced by more complete reports found on other files, but procedures for determining the most accurate information have not been developed.

OCR for page 62
Preparing for the 2000 Census: Interim Report II The Census Bureau's evaluation of the merged administrative records database did not include a replication of the methodology used in the evaluation of the separate source files. Matching between the administrative database and the census file was done first at the household level and then at the person level. Match rates of administrative addresses to census addresses were calculated with the census counts as denominators. This strategy compensates for the incompleteness of the census file, but match rates that are calculated with census numbers as denominators express the coverage of the database, not the quality of the data. For Paterson, 49 percent of the self-reported (mail-back) households and 37 percent of the nonresponse follow-up enumerated households matched to addresses in the database. For Oakland, these figures were 90 percent and 83 percent, respectively. For Louisiana, the match rates were 80 percent and 73 percent, respectively. Since the highest match rate to census addresses by any single source file in Louisiana was only 19 percent (see above), these latter rates suggest that the total number of administrative addresses in Louisiana must have been several times the number of census records. The results in Paterson continue to reflect the lack of consistent unit numbers on addresses in multi-unit buildings. Recommendation: Research aimed at developing more effective strategies for combining administrative data files should be a priority item in the Census Bureau's research agenda on administrative records. This work should include direct measurement of the quality as well as the coverage of addresses in the polled files. USES OF ADMINISTRATIVE RECORDS FOR 2000 With the above discussion of the development of an administrative records database as background, we review the Census Bureau's progress on seven possible uses of administrative records for the 2000 census. Master Address File Updating As Chapter 3 describes, the development of an up-to-date national address file through a listing operation that occurs in the year prior to the census has been replaced by the maintenance of a continuously updated Master Address File (MAF). The MAF will serve as the sampling frame for a number of Census Bureau surveys in addition to providing the accurate address list for the decennial census. To exploit an existing, high-quality source of information for updating the MAF, the Census Bureau has developed a partnership with the U.S. Postal Service (USPS) (see Chapter 3). Census Bureau plans do not include the use of any other source of administrative records to update the MAF.13 While the use of administrative records might help to 13   Historically, the Census Bureau has not defined USPS databases as administrative records.

OCR for page 62
Preparing for the 2000 Census: Interim Report II reduce one component of the undercount (housing units not listed correctly in the MAF), we acknowledge that the state of research on the development of administrative records databases is not yet sufficiently advanced to make such use efficient. It is feasible to screen only a small number of new addresses. Earlier research suggested that the administrative databases created for the three test sites contained substantially more unique addresses than the total number in the census MAFs; it is unclear to what extent this is still true for more recent research conducted with improved administrative databases. We encourage the Census Bureau to do some exploratory research aimed at understanding what the excess addresses represent. Enumeration of Special Population Groups Special population groups are so designated because their enumeration requires different procedures than those used with the rest of the population. The Bureau hopes that administrative records will be able to replace a number of advance listing operations used with special populations and will be able to provide data on people residing in groups quarters or other special housing. To date, however, the panel has received very little information on the development of plans and research for such applications. The 1996 Community Census will include an evaluation of procedures to enumerate residents of group quarters, as well as American Indians living on reservations (Whitford, 1996). The panel has not yet been briefed on the Bureau's plans for such uses of administrative records, and therefore is not certain how the research that is being conducted as part of the 1996 community census will further the Bureau's goals. Reduction of the Nonresponse Follow-Up Workload The proposed use of administrative records as a replacement for a portion of nonresponse follow-up was the subject of extensive research in the 1995 census test. A reevaluation of the 1995 findings was released in late December 1996. Further research is planned, following the 1996 community census. The Bureau's recently announced decision not to use administrative records for this purpose in the 2000 census has significant implications for the research agenda. To evaluate the potential for administrative records to be used to impute short-form items to nonresponding households prior to nonresponse follow-up enumeration, the Census Bureau matched the records from the administrative records database to the census data capture file in each site. For self-reported households--that is, those that returned census questionnaires--49 percent of the census households in Paterson, 90 percent of those in Oakland, and 80 percent of those in Louisiana could be matched to a database record on the basis of complete address. For the sample of nonresponse followup enumerated households, these match rates were 37 percent, 83 percent, and 72

OCR for page 62
Preparing for the 2000 Census: Interim Report II percent, respectively.14 The match rates for nonresponse follow-up households were consistently lower but not dramatically so. To determine how often a database household would yield a correct imputation for a census household, the Census Bureau matched people within the matching households. If all of the people in both the census and administrative households matched, this was defined as a whole household match. For imputation purposes, it is important that these whole household match rates be higher than the rates for the address matches (given above). The Census Bureau intended to impute only 5 percent of the nonresponse followup workload but hoped to impute these few households correctly. To achieve this would require high match rates at the household level. For self-reported households, the whole household match rates among address matched households in Paterson, Oakland, and Louisiana were 29 percent, 32 percent, and 29 percent, respectively. For nonresponse follow-up households, however, the match rates for the three sites were only 7 percent, 8 percent, and 13 percent, respectively. In other words, if nonresponding households were imputed unconditionally in the three sites, the low frequencies indicate how the imputed result would match the census outcomes. These low match rates support the Bureau's decision not to use administrative records to impute nonresponse follow-up households in 2000. On the basis of these findings, we support that decision. Furthermore, in our view, too little time remains to dramatically improve these match rates. Before the decision, the Census Bureau had initiated research aimed at trying to predict which nonresponse follow-up households could be correctly imputed. We are not persuaded that this strategy would ever be successful. The low match rates that the Census Bureau obtained at the three test sites imply the need for an exceedingly powerful predictive model. Furthermore, the low match rates are in part an artifact of certain limitations of the 1995 study. Basing predictive models on the 1995 study, in any event, is not likely to produce findings that can be applied in other contexts. The striking contrast between the self-reported and nonresponse follow-up enumerated households demands further study. Can this difference be due entirely to the correlation between a household's probability of nonresponse to the census and its probability of having incorrect or out-of-date administrative data, or is there another explanation? In particular, can one totally rule out any effect of the census data collection mode on the quality of the census data? It can be argued that whole household matches to the data collected by the census represent too narrow a definition of what constitutes good household data from administrative records. Households in which all of the census people were matched by administrative records but the database contained other people accounted for numbers comparable to the whole household matches in Oakland and Louisiana, although not quite one-half of the whole household matches in Paterson. The combination of whole 14   It is difficult to reconcile this match rate for Louisiana to the very low match rates for the individual source files in that test site. Unless there is an error (perhaps in the rates for individual files), this result could be explained only if the administrative records in Louisiana had records for nearly all of the census units and had additional, bad addresses numbering about six times the number of good addresses.

OCR for page 62
Preparing for the 2000 Census: Interim Report II household matches and this particular type of partial household match, representing census households in which all people were matched by people in the corresponding administrative households, numbered 11 percent of the address-matched households in Paterson, 16 percent in Oakland, and 25 percent in Louisiana. For self-reported households, the corresponding percentages were 40 percent in Paterson and 50 percent in the other two sites. The fact that the 1995 test data were lacking identifying information for dependents listed on tax returns may have contributed in a major way to the low match rates observed at the test sites. The 1996 census test will demonstrate to what extent the addition of dependent Social Security numbers to the IRS files can improve the whole household match rate15. The highest whole household match rates, by far, were obtained for households of size one, as measured by the census. These rates varied from 17 to 27 percent of the households with address matches in the three test tests. Adding partial matches in which the administrative records database accounted for the census people but included other people raises these rates to between 28 and 50 percent. That they are not higher may be attributable to the above average mobility of single-person households. Nevertheless, these match rates provide an indication of the level of matches that might be achieved in larger households if dependents were added to the IRS records. The IRS records will not capture all dependents, but if they did, one might expect the match rates for households above size one to exceed those of size one because larger households tend to be more stable as to their place of residence. The single greatest improvement may be the addition of data on dependents, which the IRS collects but which were missing from the tax year 1993 data used in the 1995 evaluation. Recommendation: Efforts to improve the match between administrative data and census data should be given high priority. These efforts should include attempts to improve the quality of the addresses obtained from administrative records (and the identification and deletion of nonresidential addresses) and to improve the coverage of household members. Recommendation: The Census Bureau should defer any research directed at predicting which administrative records are most likely to match census household records--at least until the match rates between administrative records and census data are substantially improved. Recommendation: The Census Bureau should take several steps to evaluate the potential contribution to an administrative records database of dependent data captured on tax returns. 15   There are a number of differences between the census and the IRS data with respect to where dependents are recorded as residing. These differences must be resolved in some manner if the intent in using the IRS data is to replicate census residence.

OCR for page 62
Preparing for the 2000 Census: Interim Report II   Determine what dependent Social Security numbers the IRS now captures electronically on a 100 percent basis and obtain these data on all future IRS files delivered to the Census Bureau.   Determine whether the IRS captures dependents' names in addition to their Social Security numbers and, if so, secure their inclusion on future IRS files delivered to the Census Bureau.   Determine whether the IRS captures residential status (living at "home" or away for dependent), individually or collectively, and, if so, take steps to ensure their inclusion on future IRS files delivered to the Census Bureau.   Design and carry out a research project using the 1994 tax year file with dependents' Social Security numbers to assess the potential improvement in whole household matches in the 1995 test sites. Adding these findings to those obtained in the 1996 community census will provide a better measure of the contribution of data on dependents. More importantly, the Census Bureau can proceed with the 1995 reevaluation while waiting for the 1996 community census data. Augmentation of Integrated Coverage Measurement Household Rosters The prospective use of administrative records as a source of roster names in the integrated coverage measurement interviews raises concerns about privacy and confidentiality that the Bureau has not fully addressed. Does such usage violate the confidentiality restrictions attached to the acquisition of such data? The state food stamp offices that provided data for the 1995 census test interpreted their protection of clients' confidentiality as not allowing the use of names from food stamp records in those households for which food stamps were the only administrative record source. Ultimately, the public and the agencies that lend their administrative records for the census may view such use as innocuous. It is important, however, that the Bureau lay the appropriate groundwork in terms of explaining the procedures and its rationale before committing to a course that could risk adverse effects on the census. Such groundwork should include not only the agencies that provide the records, but also organizations that are concerned about issues of privacy and confidentiality. The focus group research done to date has not addressed this particular use of administrative records, but focus group participants have expressed concerns about applications that appear to be much more innocuous (Aquirre International and Bates, 1996). The focus group results suggest that, at best, the Census Bureau may face a difficult task in convincing the public on the use of administrative records for the census, although the group participants generally have little familiarity with the issues and with Census Bureau and other agency uses of records. This situation suggests that public education campaigns have the potential to be very effective in developing public understanding and support. The use of administrative records in the integrated coverage measurement roster also raised statistical concerns. In small-scale tests in Oakland and Paterson, 18 to 32

OCR for page 62
Preparing for the 2000 Census: Interim Report II percent of the people identified in the administrative records database who did not match individuals listed in the census interview as residents of the same household were found to be census day residents who were also missed by Census-Plus. This result suggests a significant potential to add people to census households by using data from administrative records to bolster the integrated coverage measurement rosters, but the magnitude of the additions at these two sites raises concern about possible biases that the methodology might introduce. Is there any tendency for respondents to overstate census day residents if presented with names identified as coming from government records? In addition, the length of time between census day and the integrated coverage measurement interviews adds a recall effect to the measured reaction to names from administrative records. These questions are important because if there is indeed any response bias, its effect will be multiplied by its occurrence in the integrated coverage measurement sample. Recommendation: The Census Bureau should address statistical measurement issues, along with the privacy and confidentiality concerns, before committing to the use of administrative records to augment integrated coverage measurement household records for the 2000 census. Addressing privacy and confidentiality issues should include making contact with appropriate organizations. Imputation of Missing Long-Form Items As with the enumeration of special population groups, the panel has yet to see either a plan or any Census Bureau research relating to the use of administrative records to impute missing items from the long form. Any plan to use administrative records to impute missing items on the long form seems rather unrealistic at this time. Obviously, when people can be matched between administrative records and the census, using administrative records to impute missing items sounds reasonable. But what items would be imputed from what sources? More importantly, when will the Census Bureau evaluate such imputations? Such use would require an administrative records database that provides good coverage of households and their members and good coverage of long-form items. With much work remaining to be done in the development of a satisfactory database, and the decision not to use administrative records to provide short-form data for part of the nonresponse follow-up workload, we see little prospect for the successful use of administrative records to impute missing long-form items in the 2000 census. Unless the Census Bureau can develop a sound plan in fairly short order, and demonstrate some encouraging research results with the 1996 test or even the 1995 test, we suggest that the Bureau not continue with its plans. Estimation for Nonrespondents Not Selected in the Nonresponse Follow-Up Sample The Census Bureau has supported research on the use of administrative records as a component in the imputation of nonresponding households that were not included in the

OCR for page 62
Preparing for the 2000 Census: Interim Report II nonresponse follow-up sample (Zanutto, 1996). This is not one of the uses that the Census Bureau has announced for administrative records, but it may prove to be one of the most promising. The use of administrative records as a component in the estimation methodology should be less controversial than almost any other application: administrative records would not replace direct enumeration, but would instead provide some data for households whose characteristics would otherwise be estimated without the benefit of any reported information. (The Census Bureau has not discussed this use or the estimation methodology with the panel.) If it can be demonstrated that the introduction of administrative records can improve the imputation of nonsampled nonresponse follow-up households, then we believe that such use of administrative records should be given serious consideration. At this point the evidence is unclear. Improvements in the construction of an administrative records database are needed to address other applications, and the potential contribution of administrative records to the estimation of nonrespondents cannot be evaluated adequately until better administrative data become available (Zanutto, 1996; Zanutto and Zaslavsky, 1996a, in press). Experimentation with an Administrative Records Census The data requirements for a complete administrative records census are much more stringent than those of the other applications that have been discussed. Complete coverage of households is absolutely critical, for example, and all short-form items must be available for all households. To the extent that these requirements cannot be met, other ways to obtain the missing information must be developed. The Census Bureau's agenda and research are in the early stages of development, and it is premature to attempt any assessment of them. CONCLUSION The Census Bureau must be able to demonstrate in the 1998 census dress rehearsal that an administrative records methodology or set of methodologies can be implemented in real time and yield satisfactory results. After 1998 there will be no additional data for evaluation. Nor will the census schedule allow time to modify plans. The Census Bureau must also make certain that neither the content, structure, nor delivery dates of the administrative records needed for the 2000 census will change in ways that would invalidate the tested procedures or create excessive delays in file preparation. We cannot yet assess the planned application of administrative records to the enumeration of special population groups, which is being tested with data collected in the 1996 community census. Nor can we yet assess the prospects for successful experimentation with an administrative records census, although we believe that it is important not to lose the opportunity presented in 2000. Another of the originally planned uses--to reduce the nonresponse follow-up workload by imputing 5 percent of the nonresponding households--has already been rejected by the Census Bureau, and there are

OCR for page 62
Preparing for the 2000 Census: Interim Report II no explicit plans to use records other than the U.S. Postal Service Delivery Sequence File for updating the Master Address File. We are skeptical for a number of reasons about the feasibility of using administrative records to augment the integrated coverage measurement household rosters and to impute missing long-form items. We are intrigued, however, by the prospective use of administrative records as a component of the estimation methodology for census nonrespondents who are not selected for nonresponse follow-up. To support this and possible other uses of administrative records, we recommend (above) continuing research on the development of strategies for pooling multiple administrative records files to enhance their coverage and the quality of the data they contain and a number of specific research tasks that will assist this effort.