Recent Innovative Assessments
Since the early 1990s, there have been many small-scale experiments to implement assessments that were in some sense innovative. Some have not been sustainable—for a range of reasons—but others are ongoing. What made these programs innovative? What can be learned from them? Brian Stecher and Laura Hamilton provided an overview of both current innovations and those that have not continued, and a panel of people connected with the programs offered their comments.
Stecher pointed out that the sort of test that is currently typical—multiple choice, paper and pencil—was innovative when it was introduced on a large scale in the early 20th century, but is now precisely the sort that innovators want to replace. So, in that sense, an innovative assessment could be defined simply as one that is not a multiple-choice, paper-and-pencil test. That is, a test might be innovative because it:
incorporates prompts that are more complex than is typical in a printed test, such as hands-on materials, video, or multiple types of materials;
offers different kinds of response options—such as written responses, collections of materials (portfolios), or interactions with a computer—and therefore requires more sophisticated scoring procedures; or
is delivered in an innovative way, usually by computer.
These aspects of the structure of assessments, Stecher suggested, represent a variety of possibilities that are important to evaluate carefully. Several other themes worth exploring across programs, he noted, such as the challenges
related to technical quality (e.g., reliability, fairness, and validity), were discussed in a previous workshop session (see Chapter 2). Tests with innovative characteristics (like any tests) send signals to educators, students, and parents about the learning that is most valued in the system—and in many cases innovative testing has led to changes in practice. Testing also has costs, including a burden in both time and resources, which are likely to be different for different innovative assessments. Testing also provokes reactions from stakeholders, particularly politicians.
Performance and other kinds of alternative assessments were popular in the 1990s, when 24 states were using, developing, or exploring possibilities for using one of these approaches (Stecher and Hamilton, 2009). Today, those kinds of alternative assessments are much less prevalent. States have moved away from these approaches, primarily for political and budget reasons, but a look at several of the most prominent examples highlights some lessons, Brian Stecher explained. Individuals who had experience with several of the programs added their perspectives.
Vermont was a pioneer in innovative assessment, having implemented a portfolio-based program in writing and mathematics in 1991 (Stecher and Hamilton, 2009). The program was designed both to provide achievement data that would permit comparison of schools and districts and to encourage instructional improvements. Teachers and students in grades 4 and 8 collected work to represent specific accomplishments, and these portfolios were complemented by a paper-and-pencil test.
Early evaluations raised concerns about scoring reliability and the validity of the portfolio as an indicator of school quality (Koretz et al., 1996). After efforts to standardize scoring rubrics and selection criteria, the reliability improved, but evaluators concluded that the scores were not accurate enough to support judgments about school quality.
The research (Koretz et al., 1996) showed that teachers did alter their practice in response to the assessment: for example, they focused more on problem solving in mathematics. Many schools began using portfolios in other subjects, as well, because they found them useful. However, some critics said that teachers did not uniformly demonstrate a clear understanding of the intended criteria for selecting student work, and others commented that teachers began overemphasizing the specific strategies that were included in the standardized rubrics. Costs were high—$13 per student just for scoring. The program was
discontinued in the late 1990s, primarily because of concerns about the quality of the scores.
The Kentucky Instructional Results Information System (KIRIS) was closely watched because it was part of a broad-based response to a state supreme court ruling that the education system was unconstitutional (Stecher and Hamilton, 2009). The assessment as a whole covered reading, writing, social science, science, mathematics, arts and humanities, and practical living/vocational studies. The state made significant changes to its schools and accountability system, and it implemented performance assessment in 1992. The program was designed to support school-level accountability; other indicators, such as dropout, attendance, and teacher retention rates, were also part of the accountability system.
Brian Gong described the assessment program, which tested students in grades 4, 8, and 12 using some traditional multiple-choice and short-answer tests, but relying heavily on constructed-response items (none shorter than half a page). KIRIS used matrix sampling to provide school accountability information. Many performance assessments asked students to work both in groups and individually to solve problems and to use manipulatives in hands-on tasks. KIRIS included locally scored portfolios in writing and mathematics.
Evaluations of KIRIS showed that teachers changed their practice in desirable ways, such as focusing greater attention on problem solving, and they generally attributed the changes they made to the influence of open-ended items and the portfolios (Koretz et al., 1996). Despite the increased burden in time and resources, teachers and principals supported the program.
As with the Vermont program, however, evaluators found problems with both reliability and validity. The portfolios were assigned a single score (in the Vermont program there were scores for individual elements), and teachers tended to assign higher scores than the independent raters. In addition, teachers reported that they believed score gains were more attributable to familiarity with the program and test preparation than to general improvement in knowledge and skills, and research supported this belief, finding that teachers tended to emphasize the subjects tested in the grades they taught, at the expense of other subjects. Further support for this finding came from scores on the National Assessment of Educational Progress (NAEP) and the American College Testing Program (now called the ACT) for Kentucky students: they did not show growth comparable to that shown on KIRIS (Koretz and Barron, 1998). KIRIS was replaced with a more traditional assessment after only 6 years, in 1998—though that assessment also included constructed-response items—and the state continued to use portfolios to assess writing until 2009.
The Maryland School Performance Assessment Program (MSPAP), implemented in 1991, assessed reading, writing, language usage, mathematics, science, and social science at grades 3, 5, and 8 (Stecher and Hamilton, 2009). The program was designed to measure school performance and to influence instruction; it used matrix sampling to cover a broad domain and so could not provide individual scores. The entire assessment was performance based, scored by teams of Maryland teachers.
MSPAP did not have any discrete items, Steve Ferrara noted. All the items were contained within tasks organized around themes in the standards; many integrated more than one school subject, and many required group collaboration. The tasks included both short-answer items and complex, multipart response formats. MSPAP included hands-on activities, such as science experiments, and asked students to use calculators, which was controversial at the time.
Technical reviews indicated that the program met reasonable standards for both reliability and validity, although the group projects and a few other elements posed challenges. Evaluations and teacher reports also indicated that MSPAP had a positive influence on instruction. However, some critics questioned the value of the scores for evaluating schools, noting wide score fluctuations. Others objected to the “Maryland learning outcomes” assessed by the MSPAP. The MSPAP was replaced in 2002 by a more traditional assessment that provides individual student scores, a requirement of the No Child Left Behind (NCLB) Act.
The Washington Assessment of Student Learning (WASL), implemented in 1996, assessed learning goals defined by the state legislature: reading; writing; communication; mathematics; social, physical, and life sciences; civics and history; geography; arts; and health and fitness (Stecher and Hamilton, 2009). The assessment used multiple-choice, short-answer, essay, and problem-solving tasks and was supplemented by classroom-based assessments in other subjects. WASL produced individual scores and was used to evaluate schools and districts; it was also expected to have a positive influence on instruction.
Evaluations of WASL found that it met accepted standards for technical quality. The evaluations also found some indications that teachers adapted their practice in positive ways, but controversy over its effects affected its implementation. For example, the decision to use WASL as a high school exit exam was questioned because of low pass rates, and fluctuating scores raised questions about its quality. The WASL was replaced during the 2009-2010 school year with an assessment that uses multiple-choice and short-answer items. However, the state has retained some of the classroom-based assessments.
A workshop participant with experience in Washington, Joe Willhoft, pointed out several factors that affected the program’s history.1 First was that the program imposed a large testing burden on teachers and schools. After NCLB was passed, the state was administering eight tests in both elementary and middle schools, with many performance assessment features that were complex and time consuming. Many people were not expecting the testing to consume so much time. This initial reaction to the program was compounded when early score gains were followed by much slower progress. The result was frustration for both teachers and administrators.
This frustration, in turn, fueled a growing concern in the business community that state personnel were not managing the program well. Willhoft said that the initial test development contract was very inexpensive, considering the nature of the task, but when the contract was rebid costs escalated dramatically. And then, as public opinion was turning increasingly negative about the program, the policy makers who had initially sponsored it and worked to build consensus in its favor were leaving office, because of the state’s term limit law, so there were few political supporters to defend the program when it was challenged. This program was also replaced with a more traditional one.
The California Learning Assessment System (CLAS), which was implemented in 1993, assessed reading, writing, and mathematics, using performance techniques such as group activities, essays, and portfolios (Stecher and Hamilton, 2009). Some items asked students to reflect on the thinking that led to their answers. Public opposition to the test arose almost immediately, as parents complained that the test was too subjective and even that it invaded students’ privacy by asking about their feelings. Differences of opinion about CLAS led to public debate about larger differences regarding the role assessment should play in the state. Questions also arose about sampling procedures and the objectivity of the scoring. The program was discontinued after only 1 year (Kirst and Mazzeo, 1996).
NAEP Higher-Order Thinking Skills Assessment Pilot
An early, pioneering effort to explore possibilities for testing different sorts of skills than were generally being targeted in standardized tests was conducted by NAEP in 1985 and 1986 (Stecher and Hamilton, 2009). NAEP staff developed approximately 30 tasks that used a variety of formats (paper and pencil, hands-on, computer administered, etc.) to assess such
higher-order mathematics and science skills as classifying, observing and making inferences, formulating hypotheses, interpreting data, designing an experiment, and conducting a complete experiment. NAEP researchers were pleased with the results in many respects, finding that many of the tasks were successful and that conducting hands-on assessments was both feasible and worthwhile. But the pilots were expensive and took a lot of time, and school administrators found them demanding. These item types were not adopted for the NAEP science assessment after the pilot test.
Lessons from the Past
For Stecher, these examples make clear that the boldest innovations did not survive implementation on a large scale, and he suggested that hindsight reveals several clear explanations. First, he suggested, many of the programs were implemented too quickly. Had developers and policy makers moved more slowly and spent longer on pilot testing and refining, it might have been possible to iron out many of problems with scoring, reporting, reliability, and other complex elements of the assessments. Moreover, he noted that many of the states pushed forward with bold changes without necessarily having a firm scientific foundation for what they wanted to do. At the same time, the costs and the burdens on students and schools were high, which made it difficult to sustain support and resources when questions arose about technical quality. People questioned whether the innovations were worth the cost and effort.
Another factor, Stecher said, is that many states did not adequately take into account the political and other concerns that would affect public approval of the innovative approaches. In retrospect, it seemed that many of the supporters of innovative testing programs had not adequately educated policy makers and the public about the goals for the programs and how they would work. One reason for this lack, however, is that states were not always able to reconcile differences among policy makers and assessment developers regarding the role the assessment was to play. When there was a lack of clarity or agreement about goals, it was difficult to sustain support for the programs when problems arose. A final consideration for many states was the need to comply with NCLB requirements.
Even though many of the early programs did not survive intact, innovative assessment approaches remain in wide use. Laura Hamilton reviewed current examples of three of the most popular innovations: performance assessment, portfolios, and technology-supported assessment.
Essays are widely used in K-12 assessments today, particularly in tests of writing and to supplement multiple-choice items in other subjects (Stecher and Hamilton, 2009). Essays have been incorporated in the SAT (formerly known as the Scholastic Aptitude Test) and other admissions tests and are common in NAEP. They are also common in licensure and certification tests, such as bar examinations.
The K-12 sector is not currently making much use of other kinds of performance assessment, but other sectors in the United States are, as are a number of programs in other countries. One U.S. example is the Collegiate Learning Assessment (CLA), which measures student learning in colleges and universities, is administered on-line, and uses both writing tasks and performance tasks in response to a wide variety of stimuli.
The assessment system in Queensland, Australia, is designed to provide both diagnostic information about individual students and results that can be compared across states and territories. It includes both multiple-choice items and centrally developed performance tasks that can be used at the discretion of local educators and are linked to the curriculum. At the secondary level, the assessment incorporates not only essays, but also oral recitations and other performances. Performance tasks are scored locally, which raises concerns about comparability, but school comparisons are not part of the system, so the pressure is not as heavy on that issue as in the United States. Indeed, Hamilton noted, many aspects of Queensland’s system seem to have been developed specifically to avoid problems in the U.S. system, such as score inflation and narrowed curricula.
Other programs use hands-on approaches to assess complex domains. For example, the U.S. Medical Licensing Examination (USMLE) has a clinical skills component in which prospective physicians interact with patients who are trained participants. The trained patient presents a standardized set of symptoms so that candidates’ capacity to collect information, perform physical examinations, and communicate their findings to patients and colleagues can be assessed. Hamilton noted that this examination may be the closest of any to offering an assessment that approximates the real-life context for the behavior the assessment is designed to predict—a key goal for performance assessment. Nevertheless, the program has encountered technical challenges, such as limited variability among tasks (the standardized patients constitute the tasks), interrater reliability, and the length of time required (8 hours to complete the assessment).
These examples, Hamilton suggested, indicate the potential for performance assessment, but also the challenges in terms of cost, feasibility, and technical quality. For example, sampling remains a difficult problem in performance assessment. Multiple tasks are generally needed to support inferences
about a particular construct, but including them all poses a significant burden on the program.
Another difficulty is the tension between the goal of producing scores that support comparisons across schools or jurisdictions and the goal of using the assessment process itself to benefit teachers and students. The Queensland program and the essay portion of the bar exams administered by states both involve local educators or other local officials in task selection and scoring, and this may limit the comparability of scores. When the stakes attached to the results are high, centralized task selection and scoring may be preferred, but at the cost of not involving teachers and affecting instruction. Hamilton also noted that none of the program examples operate with a constraint like that of the NCLB, which requires that multiple consecutive grades be tested every year. Indeed, she suggested, “it would be difficult to adopt any of these approaches in today’s K-12 testing system without making significant changes to state policy surrounding accountability.”
Portfolio-based assessments have much less presence in K-12 testing than they once had, but they are used in other sectors in the United States and in a number of other nations. In the United States, the National Board for Professional Teaching Standards (NBPTS), which identifies accomplished teachers (from among candidates who have been teaching for a minimum of several years), asks candidates to assemble a portfolio of videotaped lessons that represent their teaching skills in particular areas. This portfolio supplements other information, collected through computer-based assessments, and allows evaluators to assess a variety of teaching skills, including so-called soft skills, practices, and attitudes, such as the capacity to reflect on a lesson and learn from experience. The assessment is extremely time-consuming, requiring up to 400 hours of a candidate’s time over 12-18 months. Because of the relatively low number of tasks, the program has relatively low reliability numbers, and it has also raised concerns about rater variability. However, it has received high marks for validity because it is seen as capturing important elements of teaching.
Computers have long been widely used in assessment, Hamilton explained, although for only a fairly limited range of purposes. For the most part they have been used to make the administration and scoring of traditional multiple-choice and essay testing easier and less expensive. However, recent technological developments have made more innovative applications more feasible, and they have the potential to alter the nature of assessment.
The increasing availability of computers in schools will make it easier to administer computerized-adaptive tests in which items are presented to a candidate on the basis of his or her responses to previous items. Many states had turned their attention away from this technology because NCLB requirements seemed to preclude its use in annual grade-level testing. However, revisions to NCLB appear likely to permit, and perhaps even encourage, the use of adaptive tests, which is already common in licensure and certification contexts.
The use of computerized simulations to allow candidates to interact with people or objects that mirror the world is another promising innovation. This technology allows students to engage in a much wider range of activities than is traditionally possible in an assessment situation, such as performing an experiment that requires the lapse of time (e.g., plant growth). It can also allow administrators to avoid many of the logistical problems of providing materials or equipment by simulating whatever is needed. Such assessments can provide rapid feedback and make it possible to track students’ problem-solving steps and errors. Medical educators have been pioneers in this area, using it as part of the USMLE: the examinee is given a patient scenario and asked to respond by ordering tests or treatments and then asked to react to the patient’s (immediate) response. Minnesota has also used simulations in its K-12 assessments (see Chapter 4).
Automated essay scoring is also beginning to gain acceptance, despite skepticism from the public. Researchers have found high correlations between human scores and automated scores, and both NAEP and the USMLE are considering using this technology. Moreover, the most common current practice is for computer-based scoring to be combined with human-based scoring. This approach takes advantage of the savings in time and resources and also provides a check on the computer-generated scores. However, some problems remain. Automated scoring systems have been developed with various methodologies, but the software generally evaluates human-scored essays and identifies a set of criteria and weights that can predict the human-assigned scores. The resulting criteria are not the same as those a human would use: for example, essay length, which correlates with other desirable essay traits, would not itself be valued by a human scorer. In addition, the criteria may not have the same results when applied across different groups of students: that is, test developers need to ensure that differences between human rater scores and scores assigned by computers do not systematically favor some subgroups of students over others. Some observers also worry that the constraints of automated scoring might limit the kinds of prompts or tasks that can be used.
In Hamilton’s view, technology clearly offers significant potential to improve large-scale assessment. It opens up possibilities for assessing new kinds of constructs and for providing detailed data. It also offers possibilities for more easily assessing students with disabilities and English language learners, and it
can provide an effective means of integrating classroom-based and large-scale assessment.
A few issues are relevant across these technologies, Hamilton noted. If students bring different levels of skill with computers to a testing situation, as is likely, the differences may affect their results: this outcome is supported by some research. Schools are increasingly likely to have the necessary infrastructure to administer such tests, but this capacity is still unequally distributed. Teachers trained to prepare students for these sorts of assessments and accurately interpret the results are also not equally distributed among schools. Another issue is that the implications of computer-based approaches for validity and reliability have not been thoroughly evaluated.