states’ individual needs. They stressed as essential elements for success in collaboration: a clear, shared mission that meets the needs of each participating state; realistic expectations of what is to be accomplished; a governing board with decision-making authority; and expert advisors to assist in maintaining quality.
The fourth design team was asked to consider the design of a science assessment to meet the requirements of NCLB in the context of psychometric and practical considerations that states are likely to face. This team’s job was to think about the choices states would have—and the constraints they would face—in trying to adapt a fairly typical assessment program to the NCLB requirements, while maintaining validity and reliability. The team assumed that the basic elements of the program that could be reconfigured would include content standards, test blueprints, test items, scoring methods, measurement models, scaling and equating procedures, standard-setting methods, and reporting procedures.
After reviewing each of these elements and their implications for the outcome, the team developed a hybrid test design that incorporates a variety of elements in common use. Their aim was to develop a model that would bring simplicity and clarity to a complex domain. While the design calls for innovative items that target significant aspects of science learning, it focuses on the collection of summative information of the kind typically used for accountability, with the proviso that classroom assessments and other tools for collecting formative data would be collected separately.
The design uses a matrix-sampling model similar to that used in the National Assessment of Educational Progress, in which students are given a variety of combinations of test forms so that a broad content domain can be covered. This design also allows for the inclusion of sets of items that can be used to compare performance among schools and districts over time. A version of vertical scaling, in which assessment can be linked by measures of common content across grades, allows for monitoring of growth over time. Moreover, a subset of the test forms could focus on different aspects of the entire domain; as a result, no one test administration would cover the whole domain, but that fact would not be license for teachers or schools to neglect the content not included in any one year.
The team acknowledges that the design is complex, and that it entails demanding statistical analysis procedures, but it believes it successfully balances the need for broad content coverage with the demands for strict comparability that arise when a significant purpose of the testing is accountability.
In addition to the models described above, the committee sought insight from approaches to assessment that have been developed in other countries. In a