AN IMPORTANT PART OF PLANNING for the decennial census is testing—trying out new procedures and techniques in order to finalize the census design before the count begins. A regular feature of the census process since the 1940 census, the Census Bureau’s program of intercensal tests has pursued several major directions (Bailar, 2000):
major changes in census methodology (most notably the conversion to mailout/mailback as the dominant mode of census collection and the use of sampling);
techniques to improve and to better measure census coverage;
improved questionnaire wording and format;
new technology; and
improved census processing.
From all indications, the Census Bureau is not eager to repeat the experience of the 2000 census, in which the lateness in reaching a general census design limited the effectiveness of operational testing. Under the heading “Lessons Learned from Census 2000,” Waite (2002) emphasized the importance of effective testing: “if we want to achieve our Census 2010 Goals, operational testing of design infrastructure must start early in
the decade and continue through the Dress Rehearsal.” In particular, the census dress rehearsal—typically held 2 years prior to census deployment—should properly be a comprehensive run-through of census machinery to fine-tune the final census design. However, in 1998, the dress rehearsal had to serve as a feasibility test for three quite different general designs, involving different levels of sampling techniques (see Section 2-B; National Research Council, 2001a).
As depicted in Table 2-1, milestones in the 2010 planning process include major census tests roughly every other year leading up to 2010. Of these, one is already complete—the 2003 National Census Test, described in Box 9.1—and the 2004 Census Test is currently being fielded (see Box 9.2). Only two major testing opportunities remain prior to 2010: the 2006 census test, which the Census Bureau has described as a systems test, and the 2008 dress rehearsal.
In this chapter, we discuss some of the basic constraints on census testing (Section 9-A). We then briefly describe our basic recommendation to the Census Bureau with regard to the shape of the remaining census tests—namely, that the 2006 test should be cast as a vital proof-of-concept test (9–B). In the last section, we outline several priorities for census testing in the remaining years prior to the 2010 census (9–C).
9–A CONSTRAINTS ON CENSUS TESTING
The testing program for a decennial census faces a number of constraints and difficulties, some of which are unique to the census context but most of which are commonly faced by businesses or agencies in developing products or systems. The completion of a test plan for the 2010 census must try to strike a balance between these competing constraints.
Of these constraints, perhaps the most pressing—and the most common—is the need to match test activities to available resources. In the development cycle of a product or system, testing can sometimes be seen as an end-of-process activity and something to be done with the resources—monetary and person-hours alike—that remain at the end of a project. Relevant to the census context, this has often been the case in the development of computer-assisted interviewing instruments by the Cen-
sus Bureau and other organizations (National Research Council, 2003b). On this account, the usual pattern of testing between decennial censuses compares favorably with that of some other sectors, in that a regular set of test activities is set up throughout the development process and, as a result, testing is a more constant presence through development. Still, the Census Bureau faces the unique problem of constructing a rigorous testing protocol under available appropriated funds, and the peak years for census testing—at the middle, and not the end, of the decade—are historically lean for funding of decennial census activities.
Limited census test budgets can affect not only the range of design options to be tested but also the means by which the test is conducted. For example, the resources available to conduct a census test may be more critical than methodological considerations in determining the number of test sites (as with the 2004 test; see Box 9.2). Likewise, resources may be more crucial to determining sample sizes for tests rather than power analysis to determine the optimal sample sizes needed to measure effects to desired precision. The choice to conduct the module of the 2003 National Census Test on race and Hispanic origin question wording was likely one of convenience using an available test activity involving a nationally representative sample from 2000 census mailout/mailback areas. However, a more meaningful test of sensitivity to wording and format considerations on the race and ethnicity questions would likely involve more refined targeting of predominantly minority and Hispanic neighborhoods as well as field interviewing.
A second, and related, major constraint on census testing is that—with the limited opportunities over the course of a decade—census tests are often formulated as test censuses. That is, the major census tests often have an omnibus character, typically involving most parts of the basic census process (see Section 2-A). Even though the tests may not provide test site locations with a new population count, it is often entirely possible that such counts could be derived, given the completeness of the process embodied in the test. There are several good reasons for the omnibus nature of census tests. It gives the major census tests the advantage of the verisimilitude of a decennial census, providing a realistic environment in which to test changes to specific techniques and allowing detection of unintended con-
Box 9.1 2003 National Census Test
Between February and April 2003, the Census Bureau conducted a National Census Test (NCT) involving approximately 250,000 households drawn from areas enumerated by mailout/mailback methods in the 2000 census. The NCT was strictly a mailout test, and so did not involve field enumerators to perform nonresponse follow-up.
The 2003 test focused primarily on two issues:
The test was rounded out by a control group of 20,000 households; this group’s questionnaire included the race and Hispanic origin questions worded as they were in the 2000 census (unlike the 2000 census context, the control group households were eligible for a replacement questionnaire in nonresponse follow-up in the 2003 test).
The samples for all groups were stratified by response rate in the 2000 census, where the classification was a grouping into “high” and “low” response groups based on a selected cut-off. Martin et al. (2003:11) comment that the low-response strata “included areas with high proportions of Blacks and Hispanics and renter-occupied housing units” and further comments that addresses in low-response areas were oversampled. Still, it is unclear whether the sample design generated enough coverage in Hispanic communities to facilitate conclusive comparisons—that is, whether it reached enough of a cross-section of the populace and a sufficiently heterogeneous mix of Hispanic nationalities and origins to gauge sensitivity to very slight and subtle changes in question wording.
With regard to the response mode and contact strategy portion of the test, results reported by Treat et al. (2003) suggest that multiple response mode options may change the distribution of responses by mode—shifting some would-be mail responses to Internet, for instance. However, the addition of choices does not generally increase cooperation overall. The experience of the 2003 test suggests serious difficulties with the interactive voice response option; 17–22 percent of IVR attempts had to be transferred to a human agent when the system detected that the respondent was having difficulty progressing through the IVR questionnaire. Moreover, rates of item nonresponse were greater for IVR returns than for the (paper response) control group. Internet returns, by comparison, experienced higher item response rates than the control. As has been indicated in past research, reminder postcards and replacement questionnaires had a positive effect on response.
Martin et al. (2003) report that the race and Hispanic origin question segment of the test showed mixed results. Predictably, elimination of “some other race” as a response category reduced “some other race” responses considerably, by 17.6 percent (that is, Hispanic respondents apparently declined to write in a generic response like “Hispanic” or “other” if “some other race” was not a formal choice). The Bureau concluded that the 17.6 percent decline in generic race reporting “more than offset” the impact of a 6.4 percent increase in the estimated number of Hispanics declining to answer the race question altogether (Martin et al., 2003:15). Adding examples of ancestry groups (e.g., Salvadoran, Mexican, Japanese, Korean) appeared to boost the reporting of detailed origins among Hispanics, Asians, and Native Hawaiian or Other Pacific Islanders. Treatment groups for which instructions were revised, instructing respondents to answer both the race and Hispanic origin questions, produced the most puzzling results; levels of missing data on one or both questions increased, as did the percentage of reporting themselves as Native Hawaiian or Other Pacific Islanders (relative to the control group).
sequences that a change might introduce in other parts of the process (e.g., difficulties that a change in questionnaire format might cause downstream in data capture). Another benefit of test censuses as census tests is that they provide an opportunity to “keep the wheels greased”—that is, they are a check to see that the complete census machinery is still in working order. But the test census model also creates difficulties; being more elaborate and involved, these tests can take longer to process and evaluate, thus potentially slowing feedback to the overall census planning process and to the development of subsequent tests. Another difficulty is the basic one of confounding: the simultaneous testing
Box 9.2 2004 Census Field Test
At this writing, the 2004 Census Field Test is being administered in test sites in two states: Colquitt, Thomas, and Tift Counties, Georgia, and a portion of northwestern Queens County (Queens Borough), New York. [Lake County, Illinois, was originally designated a test site, but was dropped after the Bush administration proposed its budget for fiscal 2004.] Though field work will be done in each of the test sites, and, in some respects, the activity will almost seem to be a census in miniature, the Census Bureau is not promising or even offering participating sites a population count at the end of the test. The test is intended to include approximately 200,000 housing units.
The operational plan for the 2004 test suggests four major topics (U.S. Census Bureau, 2003a:4–5):
Other topics that were originally intended for inclusion in the 2004 test have subsequently been dropped from the test plan; these include the mailing of a dual-language (English and Spanish) questionnaire to targeted households and—significantly, given our discussion in Chapter 3—targeted canvass methods to update the Master Address File.
Current plans call for 10 evaluation reports of the results from the 2004 test to be issued through 2005; tentative dates for initial draft reports range from February 8 through November 2, 2005, and dates for final reports range from March 31 through December 31, 2005. However, the operational plan states that “preliminary results from a sub-set of evaluations needed to inform plans for the 2006 Census Test will be available no later than December 31, 2004” (U.S. Census Bureau, 2003a:7).
of multiple design options may make it difficult or impossible to assess the impact of one particular design option.
A final basic constraint on census testing is dealing with the maturity of technologies, systems, and methodologies. As we discussed in Section 6-D, new technology is inherently difficult to manage precisely because it is new and evolving. In terms of testing, new technologies bring with them a fundamental dilemma: the continued maturing of the technology depends on the results and feedback from testing, but there is a natural reluctance to test until the technology is mature (or at least reasonably so). This basic dilemma is also apparent when testing systems or groupings of technologies, as in the census context; reluctance to test one design option may affect development of other design options, which are dependent on the first in order to fully proceed. In the specific context of the 2010 census, for example, the Census Bureau cannot wait until portable computing devices are fully mature or until the MAF/TIGER database structure is complete and in operation to begin testing those components, and it certainly cannot wait for the MAF/TIGER piece to be completed before PCD development even though the former is a key information input to the latter. [These examples are merely to illustrate a basic interdependence; we do not imply that the Bureau is waiting until completion for the testing of either of these elements.]
9–B THE 2006 CENSUS TEST AS A PROOF OF CONCEPT
The Census Bureau needs to make optimal use of the few major testing opportunities remaining before 2010 so that viable approaches are well understood before the 2010 census design
is finalized, and it must do so while facing the considerable challenges described in the previous section.
In the panel’s assessment, the combination of the schedule of major census tests and the desire for a pure dress rehearsal in 2008 puts an enormous burden on the 2006 census test. The panel firmly believes that the 2006 test should be viewed as a proof-of-concept test: it should follow the census from end to end, to the greatest extent possible, using all available systems. More importantly, it should be cast as the proving ground for any remaining experimental questions in order to make the 2008 test a truly preoperational rehearsal. Any major 2010 census innovations should be identified in some moderately complete form by 2005 so that a reasonable version can be included in the 2006 census test.
In emphasizing the importance of the 2006 test, we believe that it is also important to make two points clear. First, the Census Bureau’s hope—shared by the panel—is that the 2008 activity is to a great extent a pure dress rehearsal. That said, it is important to remember that it is also a test; things will go particularly right or wrong in 2008, and adjustments made accordingly. We note this to make clear that the 2006 test is not a completely hard-and-fast deadline for the inclusion of new technologies and techniques in the 2010 census. Some innovations will not be able to be tested in 2006 and will have to be tested in 2008; for example, it is unclear whether all functions of a redesigned MAF/TIGER database structure will be ready for 2006. What we hope to forestall by recommending that 2006 be viewed as the proof of concept is what happened in the 2000 census cycle, in which major design considerations were left for the highly experimental 1998 dress rehearsal to resolve.
The second point that we wish to make clear in calling for a proof-of-concept test in 2006 is that this test should not be viewed as the only remaining opportunity for new and experimental techniques. When possible and as resources permit, the Census Bureau should make use of other opportunities to evaluate alternative components of the 2010 design. Such opportunities might include small-scale experiments and feasibility tests, use of focus groups or small-scale laboratory-based studies for issues such as questionnaire format and other matters involving
interviewer-respondent interactions, further analysis of the data collected in conjunction with the conduct of the 2000 census, and simulation studies (also often informed by data collected from the 2000 census). In addition, as discussed in Chapter 6, the development and comparison of alternative logical infrastructure models—each reflecting different assumptions and major design features—can be an informative way to test census systems in the abstract.
9–C DESIGN OPTIONS THAT SHOULD BE CONSIDERED FOR TESTING IN 2006
In this section, we briefly review design options that should be considered for evaluation, either as part of the 2006 census test or through use of other test opportunities as discussed above. We believe that the Bureau should know enough about each design option to fully inform a decision about the make-up of the census design for use in both the 2008 dress rehearsal and then, ideally with only minor modifications, in the 2010 census.
9–C.1 Human Factors for Portable Computing Devices
A critical area of concern in much of the new technology proposed for use in the 2010 census is that of human factors. Primary examples of areas where human factors require attention are: (1) the ability of field staff to quickly and reliably learn to use the portable computing devices (PCDs), including for navigation from one assignment to the next, the conduct and reporting of interviews, data transmission of completed assignments, and receipt of new assignments; (2) the ability of field staff to use ALMI (laptop computer) or, possibly, the GPS-equipped PCDs to capture updated coordinates for TIGER; and (3) the respondent interface provided not only by paper questionnaires but also by the electronic questionnaires used in the Internet and (possibly) interactive voice response modes, especially foreign language submissions. Of these, the highest priority is that of the human factors relative to the use of portable computing devices; as we noted in Section 5-A, the ultimate success of the PCDs will rely crucially on their usability by a corps of enumerators
with relatively little training and, likely, a wide range of familiarity with using such devices. Therefore, human factors and the capacity of enumerators to successfully use the PCDs in nonresponse follow-up (as well as for other field activities such as address canvassing and coverage improvement follow-up that may be identified for PCD use) should be tested in 2006, most likely involving small-scale feasibility tests.
9–C.2 Various Cost-Benefit Trade-offs
In addition to demonstrating the basic feasibility and effectiveness of particular design options, the 2006 test should be constructed to permit assessment of various important cost-benefit trade-offs, many of which we have described in this report. Some of these may also be amenable to small-scale tests and other research activities. Regardless of how the tests are performed, it is important to learn more about these trade-offs because initial judgments about them have been used to support proposed components in the plan for the 2010 census. Some of these trade-offs are as follows:
Use of PCDs for Follow-up Interviewing
In Section 5-A.2, we describe the primary argument that the Census Bureau has used to support the plans for use of PCDs in nonresponse follow-up work: namely, that the devices will substantially reduce the amount of paper involved in the census. With the reduction in paper, the Bureau has argued that the number and size of local census offices may be reduced and that significant increases in data capture efficiency will further reduce costs. As we also noted, the panel knows of no empirical evidence for these potential cost savings.
While the 2004 census test is intended to provide some information on this trade-off, it is not clear that it will be definitive in this regard. To a large degree, the 2004 census test appears to be an extended, large-scale feasibility test. Such a test will be important to gauge responses by enumerators and respondents alike to the use of the small devices, and worthwhile as a preliminary check on the feasibility of the direct connection between census
headquarters and individual PCDs for transmittal of completed questionnaires and new enumerator assignments. However, the PCD workflow for the 2004 test is still paper-driven in some respects, with printed progress reports circulating between headquarters and the regional and local census offices. Coupled with the limitation of the test to two sites, it is thus unclear how much information will be gained about the potential for local census office reduction and how much the results might be generalized in order to justify the overall PCD cost-benefit trade-off.
To the extent that the 2004 census test is unable to definitively inform this trade-off, the 2006 test must be able to bridge the gap. The 2006 test should certainly reflect adjustments to the headquarters-PCD workflow identified as part of the 2004 test. The 2006 test should also draw from the 2004 experience in gauging how much time and resources may be required for training temporary enumerators on PCD usage. Further examination of the security concerns associated with transmitting data to and from the PCD, either by phone line (as in the 2004 test) or possibly by wireless communication, should also be done, with particular attention to cost-benefit considerations.
Use of the Internet for Census Delivery and Return
The Internet is being proposed both as a method to facilitate response to the 2010 census for those who use the computer for much of their correspondence and as a mechanism by which foreign language questionnaires may be requested and administered to those who require them. Getting a better sense of what share of the population might be amenable to Internet response is an important cost-benefit consideration, since it will affect the amount of paper to be processed at the data capture stage (and otherwise handled and stored) and will have the benefits of automated data capture and the capacity for in-process edits and consistency checks. The cost of providing an Internet response option will be relatively minor, though the costs of protective security measures to prevent breaches by hackers must also be considered. Another major cost that must be weighed regarding Internet response is the potential for increased duplicate enumerations.
The 2003 census test made a first step at addressing some of these concerns, but the 2006 test or other testing opportunities should be constructed to provide more definitive assessment. The effect of pushing for Internet response (that is, providing directions for Internet response but not an actual paper questionnaire to be returned by mail) should be measured; the 2003 test included an Internet choice as well as an option that pushed both the Internet and telephone-based interactive voice response systems, the latter of which encountered difficulties. Alternative ways of making the Internet response choice more prominent in the mailed materials should also be explored.
Use of Imputation to Replace Last-Resort Enumeration
As discussed elsewhere in this report, the process used in the 2000 census to treat initial item nonresponse differed from that used in the 1990 census: in 2000, the choice between use of imputation methods versus more intensive field follow-up work was decided in favor of imputation. In particular, imputation was used to treat responses that were said to be data-defined—i.e., they meet a minimum standard for completeness in data reporting, which in 2000 meant that one person on the form had reported at least two data items. It is important that the impact of reliance on imputation on resulting data quality be more fully assessed. While some information may be gained from further analysis of 2000 census operational data—for example, by studying the effects of imputation on the distribution of census variables—a fuller test of the trade-off between imputation and field work requires testing in a full-census environment like the 2006 test. A test of this trade-off in 2006 should involve direct comparison of the two procedures, assessment of the additional costs of the field work, and evaluation of the quality of the information collected or imputed by implementation of a reinterview survey.
9–C.3 Other Testing Considerations
Throughout this report, we have identified additional key topics that should be considered for testing in 2006. We briefly list
them here, and refer to the appropriate sections in the text for additional discussion.
Coverage Evaluation Methodology (Section 7-A): Because assessment of the completeness of the census count is essential, the Census Bureau should develop candidate methods for coverage evaluation of the 2010 census, test them in the 2006 census tests, and test final methodology in the 2008 dress rehearsal.
Housing Unit Listing and Block Canvass Methodology (Sections 5-C.1 and 3-E.6): The 2006 test should be a forum to test revised procedures for the listing and coding of housing units on the MAF, with particular attention to the problem of effectively identifying units in small multiunit structures. In addition, the 2006 test should provide an opportunity to target MAF updating to specific (e.g., high-growth) areas, rather than the complete block canvass that currently seems to be the Bureau’s choice.
Residence Rules (Section 5-B.3): The 2006 census test should feature revised rules for residence; redesigned questionnaires with clearer definitions of residence rules for respondents should be developed and tested by cognitive researchers prior to 2006.
Targeted Replacement Questionnaire Processing and Timing (Section 5-D.3): Although the 2006 census will not approach the volume of a full census, the test should be used to assess the speed with which replacement questionnaires for nonresponding households can be printed and mailed. It should also provide an opportunity for gauging the appropriate time after questionnaire mailout to implement the replacement questionnaire mailing.
Routines for Unduplication (Section 5-E): Having gone through some initial testing in the 2004 test, techniques for unduplication based on name and date-of-birth matching throughout the census response file should be refined and implemented in 2006. Testing should assess not only the accuracy of matching and the statistical models used to determine matches, but also the actual time needed for
searching and matching. Such information is crucial to determining whether “real-time unduplication” is possible in the 2010 census.
Use of Administrative Records in MAF Updating or Nonresponse Follow-up (Section 7-C.2): If use of administrative records is considered for inclusion in the 2010 census process (as opposed to a major experiment), that use should be factored into the 2006 census test and included in the 2008 dress rehearsal.
In addition to the areas listed above, we commented in Section 5-C that enumeration methods for special hard-to-count populations—including gated communities, colonias, linguistically isolated households, and the homeless—should be the focus of research well before the end-of-decade crunch immediately prior to the census. To the extent possible, revised methods for these populations should be tested in 2006 rather than waiting for the 2008 dress rehearsal (or even later). Likewise, we recommended a comprehensive reappraisal and redefinition of the methods used for the group quarters population (Section 5-B.2). In particular, improved methods for developing the roster of group quarters should be developed in time for the 2006 census test, as should techniques for integrating or cross-checking the group quarters list with the MAF. The forms used to collect information for the group quarters population should also be reexamined to determine whether they are appropriate to part or all of the group quarters population.
9–C.4 Site Selection
As a final remark, site selection for the 2006 census test is extremely important. The Census Bureau typically selects a small number of counties for its test censuses to provide an effective test of its procedures. The counties are selected to represent urban and rural regions and to include various nonminority and minority groups. We urge the Census Bureau to select test sites that will provide an extreme and rigorous test of the various elements of the census design, so that the proof-of-concept test can best inform the reengineering of the 2010 census as a whole.