A central tenet of the current "second wave" of education reform is that reliance on diverse performance assessments as the basis for school accountability will circumvent problems encountered with test-based accountability in the 1980s. This view appears to be overly optimistic.
There is some evidence, albeit limited, that shifting to performance assessments can address one of the two problems associated with test-based accountability; it can improve rather than degrade instruction. For example, Koretz et al. (1993, 1994a) found that Vermont educators reported a substantially greater emphasis on problem-solving and mathematical communication following implementation of the state's portfolio assessment program. Nevertheless, evidence about the instructional effects of performance assessment programs remains scarce. It is not clear under what circumstances these programs are conducive to improved teaching or what the effects are on student achievement.
On the other hand, there is as yet no reason to believe that test-based accountability will ameliorate the problem of inflated test scores. Indeed, there are reasons to expect that some types of performance assessments may be more susceptible to corruption. Because of the complexity of performance tasks, scores on them are likely to include sizable task-specific but construct-irrelevant variance—that is, variance that reflects idiosyncratic aspects of tasks rather than attributes of the latent trait supposedly being measured. For this reason, performance typically correlates only weakly across related tasks, even when the tasks are as similar as essays (e.g., Dunbar et al., 1991; Shavelson et al., 1993). Moreover, in most instances, performance assessments will comprise far fewer tasks than a corresponding multiple-choice test because of the student time required for each performance and the costs of developing them. Thus, a performance assessment is likely to comprise a relatively small number of poorly correlated tasks with substantial idiosyncratic attributes. Coaching students to do well on a small number of such tasks could easily lead to large gains in scores without commensurate real gains in mastery of the underlying areas of knowledge.
Performance assessment programs present other obstacles as well. They have been time consuming and costly to develop. Because of the amount of student time required to complete each performance, they increase the pressure to use a matrix-sampled design and to forego scores for individual students. In some cases they have proved difficult and costly to score (see Koretz et al., 1994a, b).
Finally, many innovative performance assessments have not yet been adequately validated. Although the tasks they comprise often seem appropriate, such "face validity," in the jargon of educational measurement, is not sufficient basis for judging them to be valid measures. In most cases little of the evidence needed for validation has been gathered; and in some cases the limited evidence available is discouraging. For example, it was noted earlier that the NAEP has