How easily might innovative assessments used for summative, accountability purposes meet these goals? First, Zwick observed, many aspects of the current vision of innovative assessment (e.g., open-ended questions, essays, hands-on science problems, computer simulations of real-world problems, and portfolios of student work) were first proposed in the early 1990s, and in some cases as far back as the work of E.F. Lindquist in the early 1950s (Lindquist, 1951; also see Linn et al., 1991). She cited as an example a science test that was part of the 1989 California Assessment Program. The assessment consisted of 15 hands-on tasks for 6th graders, set up in stations, which included developing a classification system for leaves and testing lake water to see why fish were dying (see Shavelson et al., 1993). The students conducted experiments and prepared written responses to questions. The responses were scored using a rubric developed by teachers.
This sort of assessment is intrinsically appealing, Zwick observed, but it is important to consider a few technical questions. Do the tasks really measure the intended higher-order skills? Procedural complexity does not always guarantee cognitive complexity, she noted, and, as with multiple-choice items, teaching to the test can undermine the value of its results. If students are drilled on the topics of the performance assessments, such as geometry proofs or writing 20-minute essays, it may be that, when tested, they would not need to use higher-order thinking skills to do these tasks because they have memorized how to do them.
Another question is whether the results can be generalized across tasks. Can a set of hands-on science tasks be devised that could be administered efficiently and from which one could generalize broad conclusions about students’ science skills and knowledge? Zwick noted that a significant amount of research has shown that for real-world tasks, the level of difficulty a task represents tends to vary across test takers and to depend on the specific content of the task. In other words, there tend to be large task-by-person interactions. In a study that examined the California science test discussed above, for example, Shavelson and his colleagues (1993) found that nearly 50 percent of the variability in scores was attributable to such interactions (see also Baker et al., 1993; Stecher and Hamilton, 2009).
Yet another question is whether such tests can be equitable. As is the case with multiple-choice tests, Zwick noted, performance tests may inadvertently measure skills that are irrelevant to the construct—if some students are familiar with a topic and others are not, for example, or if a task requires students to write and writing skills are not the object of measurement. Limitations in mobility and coordination may impede some students’ access to hands-on experiments at stations or their capacity to manipulate the materials. Some students may have anxiety about responding to test items in a setting that is more public than individual paper-and-pencil testing. Almost any content and format could pose this sort of issue for some students, Zwick said, and research has shown