different mixes of people directly affected by the question changes (10 of 18 were renters rather than owners and 6 of 18 were in households with infants).

As we have observed the development of the mid-decade census tests, the panel has grown concerned about the fact that there seems to be very little experimentation and testing by the Census Bureau that operates between these two extremes.

Finding 8.2: The Census Bureau often relies on small numbers (20 or less) of cognitive interviews or very large field tests (tens or hundreds of thousands of households, in omnibus census operational tests) to reach conclusions about the effectiveness of changes in census enumeration procedures. As a consequence many important questions about the effectiveness of residence rules do not get addressed effectively.

To be clear, we do not suggest by this finding that there is anything necessarily wrong with tests that operate at these extremes. In particular, we do not mean in any sense to malign small-sample cognitive testing as a research tool by the Census Bureau; cognitive tests are definitely worth doing, since they are an excellent diagnostic process (and generator of research hypotheses) that can identify major problems with specific questionnaire items and formats and can highlight problems in logic and syntax. What we do argue is that it is possible to put too much weight on cognitive tests, whose sample sizes are too small and unrepresentative to support broad conclusions; filtering possibilities and eliminating potential approaches to practical census problems on the strength of comments from a very small number of interviews is too restrictive.

Likewise, there is benefit to the massive scale census tests (or, more precisely, operational trials) that the Census Bureau regularly conducts. Particularly important is that they allow the Census Bureau to keep its field “machinery” well trained and in good working order; the sheer sample size that is possible in some of these trials also affords a variety and depth of response that is difficult to obtain through different means. However, the omnibus census tests also have conceptual weaknesses, as discussed in this report. By trying to coerce problems into a catch-all test, it is easy to “design” a test for which the great advantage of sample size is offset by the fact that the test reaches relatively few people who are most directly affected. As a previous study (National Research Council, 2004b) concluded, the 2003 census test—a major goal of which was to test the effectiveness of altered wording of the Hispanic origin question—was severely impaired because the test failed to adequately target responses from Hispanic communities. Also, even a relatively simple large-scale test—the 2000 AQE—can suffer from being forced into a large-test framework. Box 6-2 describes how the 2000 AQE questionnaire

