Even well-designed assessments will need to be augmented by other assessments. Most criterion-referenced tests are multiple-choice or short-answer tests. Although they may align closely to a standards-based system, other assessment components, such as performance measures, where students demonstrate their understanding by doing something educationally desirable, also are necessary to measure standards-based outcomes. A long-term inquiry that constitutes a genuine scientific investigation, for example, cannot be captured in a single test or even in a performance assessment allotted for a single class period.
Several states and districts are making strides in expanding external testing beyond traditional notions of testing to include more teacher involvement and to better align classroom and external summative assessments, so to better support teaching and learning. The state of Vermont (VT) was one pioneer. The state sought to develop an assessment system that served accountability purposes as well as generated data that would inform instruction and improve individual achievement (Mills, 1996). The system had three components: Students and teachers gathered work for portfolios, teachers submitted a “best piece” sample for each student, and students took a standardized test. Scoring rubrics and exemplars were used by groups of teachers around the state to score the portfolios and student work samples. Despite the different pieces in place (which also included professional development) the VT experiment faced mixed results and is still evolving. The scoring of the portfolios and student work samples lacked an adequate reliability (in the technical sense) to be used for accountability purposes (Koretz, Stecher, Klein, & McCaffrey, 1994). Many teachers saw a positive impact on student learning, due in part to the focus and feedback on specific pieces of student work that teachers provided to students during the collection and preparation process (Asp, 1998) but also acknowledged the additional time needed for portfolio preparation (Koretz, Stecher, Klein, McCaffrey, & Deibert, 1993).
Kentucky (KY) is another state that made changes to their system and faced similar challenges. The portfolio and performance-based assessment system in that state also did not achieve consistently reliable scores (Hambleton et al., 1995). Both states demonstrate that consistency across scores for samples of work requires training and time. Research on performance assessments in large-scale systems shows that variability in student performance across tasks also can be significant (Baron, 1991).