Innovative Assessment—Lessons from the Past and Present
Chapter 2 described ideas for moving forward and questions about the challenges for innovative assessments. But since the early 1990s many programs pushed past theoretical discussions and small-scale experiments to implement assessments that were in some sense innovative. Some have not been sustainable—for a range of reasons—but others are ongoing. What made these programs innovative? What can be learned from them? Brian Stecher and Laura Hamilton provided an overview of those programs, and a panel of veterans of some of these programs offered their conclusions about them.
The current typical test—multiple choice, paper and pencil—was innovative when it was introduced on a large scale in the early 20th century, Stecher pointed out, but is now precisely the sort that innovators want to replace. So, in that sense, an innovative assessment could be defined simply as one that is not a multiple-choice, paper-and-pencil test. That is, a test might be innovative because it:
incorporates prompts that are more complex than is typical in a printed test, such as hands-on materials, video, or multiple types of materials;
offers different kinds of response options, such as written responses, collections of materials (portfolios), or interactions with a computer—and therefore requires more sophisticated scoring procedures; or
is delivered in an innovative way, usually by computer.
These aspects of the structure of assessments represent a variety of possibilities that are important to evaluate carefully. Several other themes are worth
exploring across programs, such as the challenges related to technical quality (e.g., reliability, fairness, and validity), as discussed in Chapter 2. Tests with innovative characteristics (like any tests) send signals to educators, students, and parents, about the learning that is most valued in the system—and in many cases innovative testing has lead to changes in practice. Testing also has costs, including a burden in both time and resources, which are likely to be different for different innovative assessments. Testing also provokes reactions from stakeholders, particularly politicians.
In 1990 performance and other kinds of alternative assessments were popular in the states, with 24 of them using, developing, or exploring possibilities for applying one of these approaches (Stecher and Hamilton, 2009). Today they are much less prevalent. States have moved away from these approaches, primarily for political and budget reasons, but a look at several of the most prominent examples highlights some lessons, as Brain Stecher explained in a synopsis. Individuals who had experience with several of the programs added their perspectives.
Vermont was a pioneer in innovative assessment, having implemented a portfolio-based program in writing and mathematics in 1991 (Stecher and Hamilton, 2009). The program was designed both to provide achievement data that would permit comparison of schools and districts and to encourage instructional improvements. Teachers and students in grades 4 and 8 collected work to represent specific accomplishments, and these portfolios were complemented by a paper-and-pencil test.
Early evaluations raised concerns about scoring reliability and the validity of the portfolio as an indicator of school quality (Koretz et al., 1996). After efforts to standardize scoring rubrics and criteria for selecting student work, reliability improved, but evaluators concluded that the scores were not accurate enough to support judgments about school quality.
The researchers confirmed that teachers altered their practice: for example, they focused more on problem solving in mathematics. Many schools began using portfolios in several subjects because they found them useful. However, some critics observed that teachers did not uniformly demonstrate a clear understanding of the intended criteria for selecting student work, and others commented that teachers began overemphasizing the specific strategies that were included on the standardized rubrics. Costs were high—$13 per student just for scoring. The program was discontinued in the late 1990s, primarily because of concerns about the technical quality of the scores.
The Kentucky Instructional Results Information System (KIRIS) was closely watched because it was part of a broad-based response to a state Supreme Court ruling that the education system was unconstitutional (Stecher and Hamilton, 2009). The assessment as a whole covered reading, writing, social science, science, mathematics, arts and humanities, and practical living/vocational studies. The state made significant changes to its schools and accountability system, and it implemented performance assessment in 1992. The program was designed to support school-level accountability; other indicators, such as dropout, attendance, and teacher retention rates, were also part of the accountability system.
Brian Gong described the assessment program, which tested students in grades 4, 8, and 12 using some traditional multiple-choice and short-answer tests, but relied heavily on constructed-response items (none shorter than half a page). KIRIS used matrix sampling to provide school accountability information. Many performance assessments asked students to work both in groups and individually to solve problems and to use manipulatives in hands-on tasks. KIRIS included locally scored portfolios in writing and mathematics.
Evaluations of KIRIS indicated that teachers changed their practice in desirable ways, such as focusing greater attention on problem solving, and they generally attributed the changes they made to the influence of open-ended items and the portfolios (Koretz et al., 1996). Despite the increased burden in time and resources, teachers and principals supported the program.
As with the Vermont program, however, evaluators found problems with both reliability and validity. The portfolios were assigned a single score (in the Vermont program there were scores for individual elements), and teachers tended to assign higher scores than the independent raters did. Teachers reported that they believed score gains were more attributable to familiarity with the program and test preparation than to general improvement in knowledge and skills. This concern was supported by a research finding that teachers tended to emphasize the subjects tested in the grades they taught, at the expense of other subjects. While KIRIS scores trended upward, scores on the National Assessment of Educational Progress (NAEP) and the American College Testing Program for Kentucky students did not show comparable growth (Koretz and Barron, 1998), KIRIS was replaced with a more traditional assessment in 1998 (which also included constructed-response items), although the state continued to use portfolios to assess writing until 2009.
The Maryland School Performance Assessment Program (MSPAP), also implemented in 1991, assessed reading, writing, language usage, mathematics, science, and social science at grades 3, 5, and 8 (Stecher and Hamilton, 2009).
The program was designed to measure school performance and to influence instruction; it used matrix sampling to cover a broad domain and so could not provide individual scores. The entire assessment was performance based, scored by teams of Maryland teachers.
There were no discrete items, Steve Ferrara noted. All the items were contained within tasks organized around themes in the standards; many integrated more than one school subject, and many required group collaboration. The tasks included both short-answer items and complex, multipart response formats. MSPAP included hands-on activities, such as science experiments and asked students to use calculators, which was controversial at the time. Technical reviews indicated that the program met reasonable standards for both reliability and validity, although the group projects and a few other elements posed challenges. Evaluations and teacher reports also indicated that MSPAP had a positive influence on instruction. However, some critics questioned the value of the scores for evaluating schools, noting wide score fluctuations. Others objected to the “Maryland learning outcomes” assessed by the MSPAP. The MSPAP was replaced in 2002 by a more traditional assessment that provides individual student scores (which was a requirement of the No Child Left Behind [NCLB] Act).
The Washington Assessment of Student Learning (WASL), which was implemented beginning in 1996, assessed learning goals defined by the state legislature: reading; writing; communication; mathematics; social, physical, and life sciences; civics and history; geography; arts; and health and fitness (Stecher and Hamilton, 2009). The assessment used multiple-choice, short-answer, essay, and problem-solving tasks and was supplemented by classroom-based assessments in other subjects. WASL produced individual scores and was used to evaluate schools and districts; it was also expected to have a positive influence on instruction.
Evaluations of WASL found that it met accepted standards for technical quality. They also found some indications that teachers adapted their practice in positive ways, but controversy over its effects affected its implementation. For example, the decision to use WASL as a high school exit exam was questioned because of low pass rates, and fluctuating scores raised questions about its quality. The WASL was replaced during the 2009-2010 school year with an assessment that uses multiple-choice and short-answer items. However, the state has retained some of the classroom-based assessments.
A participant with experience in Washington pointed out several factors that affected the program’s history. First, the program imposed a large testing burden on teachers and schools. After NCLB was passed, the state was administering eight tests in both elementary and middle schools, with many
performance assessment features that were complex and time consuming. Many people were caught off guard by the amount of time the testing consumed. This initial reaction to the program was compounded when early score gains were followed by much slower progress. The result was frustrating for both teachers and administrators.
This frustration, in turn, fueled a growing concern in the business community that state personnel were not managing the program well. The participant said that the initial test development contract was very inexpensive, considering the nature of the task, and when the contract was rebid costs escalated dramatically. And then, as public opinion turned increasingly negative about the program, the policy makers who had initially sponsored it and worked to build consensus in its favor left office, because of the state’s term limit law, so there were few powerful supporters to defend the program when it was challenged. This program was also replaced with a more traditional one.
The California Learning Assessment System (CLAS), which was implemented beginning in 1993, assessed reading, writing, and mathematics, using performance techniques such as group activities, essays, and portfolios (Stecher and Hamilton, 2009). Some items asked students to reflect on the thinking that led to their answers. Public opposition to the test arose almost immediately, as parents complained that the test was too subjective and even that it invaded students’ privacy by asking about their feelings. Differences of opinion about CLAS led to public debate of larger differences regarding the role assessment should play in the state. Questions also arose about sampling procedures and the objectivity of the scoring. The program was discontinued after only 1 year (Kirst and Mazzeo, 1996).
NAEP Higher-Order Thinking Skills Assessment Pilot
An early pioneering effort to explore possibilities for testing different sorts of skills than were generally being targeted in standardized tests was conducted by NAEP in 1985 and 1986 (Stecher and Hamilton, 2009). NAEP staff developed approximately 30 tasks that used a variety of formats (paper and pencil, hands-on, computer administered, etc.) to assess such higher-order mathematics and science skills as classifying, observing and making inferences, formulating hypotheses, interpreting data, designing an experiment, and conducting a complete experiment. NAEP researchers were pleased with the results in many respects, finding that many of the tasks were successful and that conducting hands-on assessments was both feasible and worthwhile. But the pilots were expensive and took a lot of time, and school administrators found them
demanding. These item types were not adopted for the NAEP science assessment after the pilot test.
Lessons from the Past
For Stecher, these examples make clear that the boldest innovations did not survive implementation on a large scale, and he suggested that hindsight reveals several clear explanations. First, he suggested that many of the programs were implemented too quickly. Had developers and policy makers moved more slowly and spent longer on pilot testing and refining, it might have been possible to iron out many of problems with scoring, reporting, reliability, and so forth.
Similarly, many of the states pushed forward with bold changes without necessarily having a firm scientific foundation for what they wanted to do. At the same time, the costs and the burdens on students and schools were high, which made it difficult to sustain support and resources when questions arose about technical quality. People questioned whether the innovations were worth the cost and effort.
Another factor, Stecher said, is that many states did not adequately take into account the political and other concerns that would affect public approval of the innovative approaches. In retrospect, it seemed that many had not done enough to educate policy makers and the public about the goals for the program and how it would work. One reason for this lack, however, is that states were not always able to reconcile differences among policy makers and assessment developers regarding the role the assessment was to play. When there was a lack of clarity or agreement about goals, it was difficult to sustain support for the programs when problems arose. And a final consideration for many states was the need to comply with NCLB requirements.
Even though many of the early programs did not survive intact, innovative assessment approaches remain in wide use. Laura Hamilton reviewed current examples of three of the most popular innovations: performance assessment, portfolios, and technology-supported assessment.
Essays are widely used in K-12 assessments today, particularly in tests of writing and to supplement multiple-choice items in other subjects (Stecher and Hamilton, 2009). Essays have been incorporated into the SAT and other admissions tests and are common in NAEP. They are also common in licensure and certification tests, such as bar examinations.
The K-12 sector is not currently making much use of other kinds of performance assessment, but other sectors in the United States are, as are a number of programs in other countries. One U.S. example is the Collegiate Learning Assessment (CLA), which measures student learning in colleges and universities, is administered on-line, and uses both writing tasks and performance tasks in response to a wide variety of stimuli.
The assessment system in Queensland, Australia, is designed to provide both diagnostic information about individual students and results that can be compared across states and territories. It includes both multiple-choice items and centrally developed performance tasks that can be used at the discretion of local educators and are linked to the curriculum. At the secondary level, the assessment incorporates not only essays, but also oral recitations and other performances. Performance tasks are scored locally, which raises concerns about comparability, but school comparisons are not part of the system, so the pressure is not as heavy on that issue as in the United States. Indeed, Hamilton noted, many aspects of Queensland’s system seem to have been developed specifically to avoid the problems associated with the U.S. system, such as score inflation and narrowed curricula.
Other programs use hands-on approaches to assess complex domains. For example, the U.S. Medical Licensing Examination (USMLE) has a clinical skills component in which prospective physicians interact with patients who are trained participants. The trained patient presents a standardized set of symptoms so that candidates’ capacity to collect information, perform physical examinations, and communicate their findings to patients and colleagues can be assessed. Hamilton noted that this examination may be the closest of any to offering an assessment that approximates the real-life context for the behavior the assessment is designed to predict—a key goal for performance assessment. Nevertheless, the program has encountered technical challenges, such as limited variability among tasks (the standardized patients constitute the tasks), interrater reliability, and the length of time required (8 hours) to complete the assessment.
These examples, Hamilton noted, illustrate the potential that performance assessment offers, but also demonstrate the challenges in terms of cost, feasibility, and technical quality. For example, sampling remains a difficult problem in performance assessment. Multiple tasks are generally needed to support inferences about a particular construct, but including many tasks can pose a significant burden.
Another difficulty is the tension between the goal of producing scores that support comparisons across schools or jurisdictions and the goal of using the assessment process itself to benefit teachers and students. The Queensland program and the essay portion of the bar exams administered by states both involve local educators or other local officials in task selection and scoring, and this may limit the comparability of scores. When the stakes attached to the
results are high, centralized task selection and scoring may be preferred, but at a cost in terms of involving teachers and affecting instruction. Hamilton also noted that none of the examples operated with a constraint such as the NCLB requirement that multiple consecutive grades be tested every year. Indeed, she judged that, “it would be difficult to adopt any of these approaches in today’s K-12 testing system without making significant changes to state policy surrounding accountability.”
Portfolio-based assessments have much less presence in K-12 testing than they once had, but they are used in other sectors in the United States and in a number of other nations. In the United States, the National Board for Professional Teaching Standards (NBPTS), which identifies accomplished teachers (from among candidates who have been teaching for a minimum of several years), asks candidates to assemble a portfolio of information that represents their teaching skills in particular areas, including videotapes of their lessons. This portfolio supplements other information, collected through computer-based assessments, and allows evaluators to assess a variety of teaching skills, including so-called soft skills, practices, and attitudes, such as the capacity to reflect on a lesson and learn from experience. The assessment is time consuming, requiring up to 400 hours of candidates’ time over 12-18 months. Because of the low number of tasks, the program reports relatively low reliability estimates, and it has also raised concerns about rater variability. However, it has received high marks for validity because it is seen as capturing important elements of teaching.
Computers have long been widely used in assessment, although, Hamilton explained, for only a fairly limited range of purposes. For the most part computers have been used to make the administration and scoring of traditional multiple-choice and essay testing easier and less expensive. However, recent technological developments have made more innovative applications more feasible, and they have the potential to alter the nature of assessment.
First, the increasing availability of computers in schools will make it easier to administer computerized-adaptive tests in which items are presented to a candidate on the basis of his or her responses to previous items. Many states had turned their attention away from this technology because NCLB requirements seemed to preclude its use in annual grade-level testing. However, revisions appear likely to permit, and perhaps even encourage, the use of adaptive tests, which is already common in licensure and certification testing.
The use of computerized simulations to allow candidates to interact with
people or objects that mirror the real world is another promising innovation. This technology allows students to engage in a much wider range of activities than is traditionally possible in an assessment situation, such as performing an experiment that requires the lapse of time (e.g., plant growth). It can also allow administrators to avoid many of the logistical problems of providing materials or equipment by simulating whatever is needed. Such assessments can provide rapid feedback and make it possible to track students’ problem-solving steps and errors. Medical educators have been pioneers in this area, using it as part of the USMLE: the examinee is given a patient scenario and asked to respond by ordering test or treatments and then asked to react to the patient’s (immediate) response. Minnesota has also used simulations in its K-12 assessments (discussed in Chapter 4).
Automated essay scoring is also beginning to gain acceptance, despite skepticism from the public. Researchers have found high correlations between human scores and automated scores, and both NAEP and the USMLE are considering using this technology. Moreover, the most common current practice is for computer-based scoring to be combined with human-based scoring. This approach takes advantage of the savings in time and resources while providing a check on the computer-generated scores. However, some problems remain. Automated scoring systems have been developed with various methodologies, but the software generally evaluates human-scored essays and identifies a set of criteria and weights that can predict the human-assigned scores. The resulting criteria are not the same as those a human would use (e.g., essay length, which correlates with other desirable essay traits, would not itself be valued by a human scorer), and may not have the same results when applied across different groups of students. That is, test developers need to ensure that differences between human rater scores and scores assigned by computers do not systematically favor some groups of students over others. Some observers also worry that the constraints of automated scoring might limit the kinds of prompts or tasks that can be used.
In Hamilton’s view, technology clearly offers significant potential to improve large-scale assessment. It opens up possibilities for assessing new kinds of constructs and for providing detailed data. It also offers possibilities for more easily assessing students with disabilities and English-language learners, and it can provide an effective means of integrating classroom-based and large-scale assessment.
A few issues are relevant across these technologies, Hamilton noted. If students bring different levels of skill with computers to a testing situation, as is likely, the differences may affect their results: this outcome is supported by some research. Schools are increasingly likely to have the necessary infrastructure to administer such tests, but this capacity is still unequally distributed. Teachers trained to prepare students for these sorts of assessments and accurately interpret the results are also not equally distributed among schools.
Another issue is that the implications of computer-based approaches for validity and reliability have not been thoroughly evaluated.
Looking to the Future
The challenge of developing innovative assessments that are high in quality and cost-effective has not yet been fully resolved, in Hamilton’s view. “Recent history suggests that the less you constrain prompts and responses, the more technically and logistically difficult it can be to obtain high-quality results,” she observed. Yet past and current work has explored ways to measure important constructs that were not previously accessible, and the sheer number and diversity of innovations has been “impressive and likely to shape future test development in significant ways,” she said. Work in science has been at the forefront, in part because problem solving and inquiry are important components of the domain but have been difficult to assess. She suggested that many pioneering efforts in science assessments are likely to find applications in other subjects, over time. This is an important development because the current policy debate has emphasized the role that data should play in decision making at all levels of the education system, from determining teacher and principal pay to informing day-to-day instructional decisions. Assessments that provide deeper and richer information will better meet the needs of students, educators, and policy makers. The possibility of broadening the scope of what can be assessed for accountability purposes is also likely to reinforce other kinds of reform efforts.
Hamilton also pointed out the tradeoffs inherent in any proposed use of large-scale assessment data. Test-score data are playing an increasingly prominent role in policy discussions because of their integral role in statistical analyses that can be used to support inferences about teachers’ performance and other accountability questions. Growth, or value-added, models (designed to isolate the effects of teachers from other possible influences on achievement) rely on annual testing of consecutive grades—a need that may mean significant constraints on the sorts of innovative assessments that can be used. Using a combination of traditional and innovative assessments may provide a suitable tradeoff, Hamilton said. She also called for improved integration between classroom and large-scale assessments. “No single assessment is likely to serve the needs of a large-scale program and classroom-level decision making equally well,” she argued. A coordinated system that includes a variety of assessment types to address the needs of different user groups might be the wisest solution.