DEVELOPING ASSESSMENTS ALIGNED WITH STANDARDS

THE ISSUE IN BRIEF

Assessments aligned with standards are a keystone of the new reform agenda. It might be said that much of the success of standards-based reform hinges on assessments that are not yet perfected or, in some cases, even invented.

There is widespread belief that these assessments should include some type of performance measurement, given the knowledge and skills being addressed in content standards. (For example, it is difficult to test a student's knowledge of—and ability to conduct or participate in—scientific inquiry solely on the basis of multiple-choice items.) Test developers, researchers, and practitioners are already piloting various performance-based formats—portfolios of student work, written essays, observations of student performance, for instance —but many of these assessments are still in the early stages, and their effects, good or bad, are not fully known.

◦ ◦ ◦ ◦

“A lot of the trouble that we've gotten into on assessments is that they've been used for purposes for which they weren't designed.”

Gordon Ambach

◦ ◦ ◦ ◦

The tendency in American education has been to apply relatively sophisticated tests to a variety of functions, including some for which they were never designed, then worry later about whether the uses were appropriate and how they affected instruction and students.

The current situation presents an opportunity for the nation to do things differently this time, by analyzing important reliability and validity questions up front, by designing standards and assessments with specific uses in mind, and by applying them cautiously to high stakes decisions. Although some reform advocates warn that an over-cautious requirement of scientific rigor will delay implementation and progress, workshop participants generally agreed that a consensus is growing for careful attention to the scientific and technological bases for assessments in their various applications.

How should states approach the task of developing new assessments? What lessons can be learned from current state programs of performance-based assessment? What are the major technical considerations? How can states ensure that the new assessments are used appropriately and have a positive impact on instruction? Workshop participants weighed these and related questions.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 9
Anticipating Goals 2000 DEVELOPING ASSESSMENTS ALIGNED WITH STANDARDS THE ISSUE IN BRIEF Assessments aligned with standards are a keystone of the new reform agenda. It might be said that much of the success of standards-based reform hinges on assessments that are not yet perfected or, in some cases, even invented. There is widespread belief that these assessments should include some type of performance measurement, given the knowledge and skills being addressed in content standards. (For example, it is difficult to test a student's knowledge of—and ability to conduct or participate in—scientific inquiry solely on the basis of multiple-choice items.) Test developers, researchers, and practitioners are already piloting various performance-based formats—portfolios of student work, written essays, observations of student performance, for instance —but many of these assessments are still in the early stages, and their effects, good or bad, are not fully known. ◦ ◦ ◦ ◦ “A lot of the trouble that we've gotten into on assessments is that they've been used for purposes for which they weren't designed.” Gordon Ambach ◦ ◦ ◦ ◦ The tendency in American education has been to apply relatively sophisticated tests to a variety of functions, including some for which they were never designed, then worry later about whether the uses were appropriate and how they affected instruction and students. The current situation presents an opportunity for the nation to do things differently this time, by analyzing important reliability and validity questions up front, by designing standards and assessments with specific uses in mind, and by applying them cautiously to high stakes decisions. Although some reform advocates warn that an over-cautious requirement of scientific rigor will delay implementation and progress, workshop participants generally agreed that a consensus is growing for careful attention to the scientific and technological bases for assessments in their various applications. How should states approach the task of developing new assessments? What lessons can be learned from current state programs of performance-based assessment? What are the major technical considerations? How can states ensure that the new assessments are used appropriately and have a positive impact on instruction? Workshop participants weighed these and related questions.

OCR for page 9
Anticipating Goals 2000 VIEWS FROM THE WORKSHOP Approaching Assessment Development The enactment of Goals 2000 and the near-completion of legislation to reauthorize the Elementary and Secondary Education Act (ESEA) 4 speak to the need for an immediate and extensive research and development effort. The workshop yielded several suggestions for how a development effort could be approached. Discussants noted that some potential pitfalls could be avoided if standard-setting groups considered assessment issues at the same time they developed content and performance standards: standards would be less likely to be built around unrealistic assumptions about what assessment technology can deliver, and federal and state governments would be less likely to attach high stakes to assessments before they were technically ready—or at least would be more aware of the consequences if they did. ◦ ◦ ◦ ◦ “The evidence is building that innovative assessments can be a powerful tool for reform, but it is unambiguously the case that many of the proponents have egregiously overpromised.” Daniel Koretz ◦ ◦ ◦ ◦ In developing assessments, states would be well advised to initiate an open dialogue about the broader social and policy implications of assessment, including appropriate test use, appropriate reporting and interpretation of results, impacts on various groups of students by race, ethnicity, gender, and socioeconomic background, effects on instruction, costs and benefits of new assessments, and teacher professional development needs emanating from new standards and related assessment formats. These questions are too important to be decided by default, it was argued, and should not be dropped into the laps of test designers and measurement specialists without a public airing. Participants strongly urged that research on assessment be a continuous process that does not end when new assessments are implemented. The process should include initial empirical research during the standards and assessment development phase, pilots and demonstrations during the preimplementation phase, and ongoing studies to monitor the implementation of the standards and assessments themselves and provide feedback for continuous revision. These studies—which might be in the form of an annual state report card on standards and assessments—could also identify areas in which additional research is needed. 4 ESEA passed in October 1994, as the “Improving America's Schools Act of 1994.” Workshop participants discussed versions of the bill as they existed in March.

OCR for page 9
Anticipating Goals 2000 Lessons from Vermont Research should begin by studying the lessons emerging from existing innovative assessment programs. One such program is Vermont's new assessment system, which emphasizes student portfolios. The Vermont portfolio program appears to be having powerful and positive effects on instruction, according to Daniel Koretz, such as encouraging mathematics teachers to devote more time to problem solving and motivating teachers who had seemed impervious to change. But these positive effects have come with a steep price of time, stress, and money: teachers reported spending an average of 30 hours per month on portfolios, excluding training (although most say they consider the time a worthwhile burden). And from early accounts, the costs of scoring, training, and other administrative functions are likely to be much higher than the $33 per student estimated by the U.S. General Accounting Office.5 Preliminary evidence from Vermont raises serious questions of reliability, validity, feasibility, and bias that need more attention before portfolio data are applied on a larger scale or for high-stakes decisions, Koretz said. Scores to date have been too unreliable to be used for making comparisons across schools, for example. Efforts to appraise validity have been hindered by a lack of comparable achievement data, and the comparisons made thus far raise doubts about whether validity problems can be overcome. Teachers vary widely in their implementation of the portfolio program, which could threaten the validity of any comparison data. Other problems in Vermont with national implications include difficulty in training large numbers of raters to a level of sufficient accuracy, a lack of standardization of performance tasks, and the limited ability to generalize about student knowledge from a small number of tasks. The Vermont experience suggests that the twin goals of new assessments —to improve instruction and to yield high-quality comparative data —may not be totally reconcilable. A brief illustration: from an instructional perspective, it makes sense for teachers to vary performance tasks for students of different achievement levels so that lower-achieving students are not discouraged by constant failure; from a measurement perspective, however, it is problematic. Policy makers may have to accept lower levels of reliability as a price for using teacher- 5 Student Testing: Current Extent and Expenditures, with Cost Estimates for a National Examination (GAO/PEMD-93-8, January 13, 1993). Available from the U.S. Government Printing Office, Washington, D.C.

OCR for page 9
Anticipating Goals 2000 developed and scored performance assessments for accountability purposes. Expressed differently, this lesson from Vermont can be summarized in terms of the following tension that needs to be understood by policy makers: comparison across students or schools requires standardization, whereas improved learning for all students may require less standardization and the capacity to accommodate to specific learning needs that vary within and across classrooms.6 The Vermont experience affirms the wisdom of having modest expectations, evaluating the planned assessments, and allowing for a long experimentation period, luxuries that may not always be available. ◦ ◦ ◦ ◦ “A major dilemma we face is that the technical tools at our disposal for assessment were created at a time when the field had a different sense of what constitutes knowledge and understanding. Thus, we have at our disposal a wonderful set of technical tools that deal with precisely the wrong questions. We need to develop technical tools that will help us make progress on issues related to the construction of meaningful and reliable standards.” Alan Schoenfeld ◦ ◦ ◦ ◦ Technical Questions As indicated by the Vermont experience, a variety of technical issues —not the least of which are reliability and validity—should be the subject of extensive research. One issue needing further study is how to identify the tasks to be included in performance assessment. For example, although it may be easy to conceive of a real-world problem that engages thinking skills, content knowledge, and writing skills, it is more difficult to create an assessment item with these features that also meets reasonable measurement criteria: generalizability, reliability, and comparability. Limited generalizability of performance assessment tasks poses a particularly formidable barrier. Can a small number of items cover a content domain? Does successful performance on one task generalize to success on other tasks? 6   Vermont is, of course, not the only state in which tensions have mounted over the twin demands for standardized reporting of individual-level test data and instructionally valuable methods of assessment. The California Learning Assessment System (CLAS), for example, was an innovative program based on performance measures of achievement closely aligned to curriculum frameworks that had been developed over many years. CLAS ran into significant problems that were attributable, at least in part, to the conflicting demands for standardized data that provide a reliable basis for comparisons of individual achievement and assessments that are considered instructionally valuable. This tension was exacerbated by the need to hold down the costs of the performance assessment program by implementing a sampling methodology, which conflicted with demands that all children be included in what had been promoted as instructionally valuable exercises. The workshop discussion did not focus on the California experience; a board bulletin planned for the near future will address some of the salient issues in greater detail.

OCR for page 9
Anticipating Goals 2000 Still another critical issue is how to mix multiple measures into an integrated assessment system. How can information from performance assessments and more conventional tests be merged into a picture of progress at the student, school, and district levels? How can qualitative judgments be blended with quantitative data? What happens when the information is contradictory? When is matrix sampling appropriate, and when should universal testing be used? ◦ ◦ ◦ ◦ “There is a real danger of jumping to reliance on a set of measures and a technology that is not really there yet—and then we may find that it doesn't work very well, and go back to the things that had been familiar. There is this sense that the new measures are not corruptible; it was the old measures that were corruptible. . . . We have to be careful that the extravagant promises being made around the country right now [for performance assessment] don't sow the seeds for the whole thing falling apart.” Robert Linn ◦ ◦ ◦ ◦ Reporting of information raises another set of technical questions. Conventional reporting uses a “cut score” approach. Board members questioned, however, whether this approach is compatible with the intent of performance assessment. What is needed is a reporting approach that captures the richness of the performance but is also clear and understandable to students, parents, and the public. One suggestion was to use a “Consumer Reports” approach, with symbols and rankings for different skills and attributes and written comments that provide more detail on performance. Whatever the approach, it is likely to require a substantial public information effort to help parents, the media, and others understand new test scoring and reporting methods. Other topics for additional technical research include approaches for assessing linguistic minorities; procedures for aggregating results across schools, districts, states, the nation, and even the globe; and interim policies for moving from current testing modes to new methods. The latter issue is particularly important with respect to proposed revisions to testing and evaluation requirements under Title I of the Elementary and Secondary Education Act (see also the discussion in the section on federalism). Appropriate Use of Assessments An issue that merits early and full debate is the appropriate and fair use of various types of standards-based assessments. Board members recommended that new assessments be clearly differentiated, perhaps even labeled, as to whether they are

OCR for page 9
Anticipating Goals 2000 appropriate for diagnosing student progress and needs, monitoring or comparing the progress of teachers, schools, and school systems, governing the application of sanctions or rewards, or determining individual credentialing. It is also important to delineate whether tests are appropriate for individual use, aggregate use, or both. Cautions were raised about the possibility of the “corruptibility ” of measures applied to high-stakes decisions.7 Effects on Teaching and Learning Another critical issue is the effect of standards-based assessments on student learning. Some board members suggested that when tests have meaningful consequences, they influence student efforts to learn, teacher efforts to instruct, and parent efforts to support learning. Others contended that, although students may perform well on an assessment, it is difficult to know whether they have truly learned the underlying construct. Still others felt that when tests are aligned closely with local curriculum and classroom instructional methods and when the performance assessed involves higher-order skills, it does not matter whether one is able to disentangle the performance from the underlying construct or whether a student has been coached to higher levels of performance. ◦ ◦ ◦ ◦ “Instead of thinking about a single national evaluation, we would probably learn a lot more from a series of smaller research studies that would look at specific sectors of the population and try to answer the most important question: What works best for whom [and] under what circumstances?” Luis Laosa ◦ ◦ ◦ ◦ Related questions for research include whether certain types of assessments are better motivators than others and how new assessments affect learning disparities among various groups of students. Another critical area for research is the effect of new assessments on instruction. Some board members questioned whether meaningful experiments could be designed to answer these kinds of questions when so many variables impinge on the learning environment. An alternative is an auditing or inspectorate approach that examines whether opportunities to learn 7   Corruptible in this context means that the reliability or validity of the inferences drawn from an assessment are threatened by the behavior of test-takers or administerers of the tests. For example, “teaching to the test” means that teachers focus their lessons so as to raise the chances that their students will answer anticipated test items correctly, which can result in inflated test scores but not necessarily in increased learning of the underlying content or domain from which the test is meant to sample.