standards such as writing skill and the ability to communicate mathematically—require more costly hand scoring.
Similarly, the technical requirements for tests—particularly when consequences are attached to the results—have led some states to limit the use of performance items. Researchers have found that the technical quality of some performance items may have been insufficiently strong for use in high-stakes situations (Western Michigan University Evaluation Center, 1995; Koretz et al., 1993; Cronbach et al., 1994).
Researchers at the National Center for Research on Evaluation, Standards, and Student Testing (CRESST) have developed an approach to building performance assessment that is designed to link directly the expectations for learning embedded in content standards to the tasks on the assessment. Known as model-based performance assessment, the approach combines a means of enabling states and districts to assess student learning against standards with a way, through clear specifications, of providing instructional guidance to classroom teachers (Baker et al., 1991, 1999; Baker, 1997; Glaser, 1991; Mislevy, 1993; Niemi, 1997; Webb and Romberg, 1992).
Validity of Inferences. The ability of tests to accomplish the reformers' aims of reporting student progress on standards and informing instruction depends on the validity of the inferences one can draw from the assessment information. For individual students, a one-or two-hour test can provide only a small amount of information about knowledge and skill in a domain; a test composed of performance items can provide even less information (although the quality of the information is quite different). For classrooms, schools, and school districts, the amount of information such tests can provide is greater, since the limited information from individual students can be aggregated and random errors of measurement will cancel each other out.
Yet the information available from large-scale tests about the performance of schools and school districts is still limited. One reason for this is because overall averages mask the performance of groups within the total, and the averages may be misleading. This is particularly problematic because variations within schools tend to be greater than variations among schools (Willms, 1998). For example, consider two schools. School A has a high proportion of high-achieving students, yet its low-achieving students perform very poorly. School B has fewer high performers than School A, yet its lower-achieving students perform considerably better than those at School A do. Using only average scores, a district policy maker might conclude that School A is more effective than School B, even though a number of students in School A perform at low levels. School B's success in raising the performance of its lower-achieving students, meanwhile, would get lost.
Determining whether a school's superior performance is the result of superior instructional practices is difficult in any case, because academic perfor-