5

Setting Reasonable and Useful Performance Standards

Summary Conclusion 5. Standards-based reporting is intended to be useful in communicating student results, but the current process for setting NAEP achievement levels is fundamentally flawed.

Summary Recommendation 5. The current process for setting achievement levels should be replaced. New models for setting achievement levels should be developed in which the judgmental process and data are made clearer to NAEP's users.

INTRODUCTION

The current NAEP authorizing legislation, the Improving America's Schools Act (P.L. 103-328), states that ''The National Assessment Governing Board … shall develop appropriate student performance levels for each age and grade in each subject area to be tested under the National Assessment. …'' The National Assessment Governing Board (NAGB) first began its groundbreaking work on the development of the performance standards for student achievement in 1990. Since that time, results from most of the main NAEP assessments have been reported not only in descriptive terms—summary scores that reflect what students know and can do in NAEP's subject areas—but in evaluative terms—percentages of students that reach specific levels of performance defined by what students should know and be able to do. In keeping with its historic commitment to reporting results on metrics understandable to policy makers and the public, NAGB has used these performance standards, more commonly known as NAEP



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress 5 Setting Reasonable and Useful Performance Standards Summary Conclusion 5. Standards-based reporting is intended to be useful in communicating student results, but the current process for setting NAEP achievement levels is fundamentally flawed. Summary Recommendation 5. The current process for setting achievement levels should be replaced. New models for setting achievement levels should be developed in which the judgmental process and data are made clearer to NAEP's users. INTRODUCTION The current NAEP authorizing legislation, the Improving America's Schools Act (P.L. 103-328), states that ''The National Assessment Governing Board … shall develop appropriate student performance levels for each age and grade in each subject area to be tested under the National Assessment. …'' The National Assessment Governing Board (NAGB) first began its groundbreaking work on the development of the performance standards for student achievement in 1990. Since that time, results from most of the main NAEP assessments have been reported not only in descriptive terms—summary scores that reflect what students know and can do in NAEP's subject areas—but in evaluative terms—percentages of students that reach specific levels of performance defined by what students should know and be able to do. In keeping with its historic commitment to reporting results on metrics understandable to policy makers and the public, NAGB has used these performance standards, more commonly known as NAEP

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress achievement levels, to chart the progress of the nation's students toward high academic achievement. Reporting results in relation to performance standards is a mechanism by which NAEP currently fulfills the evaluative needs of its users—their need to understand whether student achievement, as presented in descriptive results, is "good enough." In this chapter, we begin by providing an overview of NAEP's performance standards and the achievement-level-setting process as it was conducted through the 1996 main NAEP assessments. We then summarize the major findings of previous evaluations and research efforts that have examined this process and present a detailed accounting and evaluation of the achievement-level-setting process as it was applied to the 1996 NAEP science assessment. We follow with a discussion of the committee's major conclusions regarding performance standards and the achievement-level-setting process, and then present recommendations that lay out constructive steps by which NAEP can improve how it fulfills this critical evaluative function. NAEP PERFORMANCE STANDARDS AND THE ACHIEVEMENT-LEVEL-SETTING PROCESS Goals of Standards-Based Reporting As described earlier, in the 1970s and early 1980s, NAEP reports were built around the assessment materials themselves; by displaying individual assessment items and associated student performance data, initial reports allowed NAEP users to review the types of tasks students could and could not do. Since the implementation of the first redesign of NAEP in 1984, data on item responses have been summarized across items to provide a picture of overall performance for the nation and for key demographic subgroups. Group (or subgroup) performance has been reported on a 300-, 400-, or 500-point scale and, until recently, has been accompanied by descriptions of the knowledge and skills typical of performance at given scaled score levels. Current NAEP Report Cards for both main NAEP and trend NAEP continue the convention of reporting overall performance as a summary scale score (Pellegrino et al., 1998). NAGB's recent, congressionally mandated work on performance standards for NAEP has added fundamentally new data to the reporting of main NAEP results. NAGB has established policy definitions for three levels of student achievement—basic, proficient, and advanced (Reese et al., 1997:8): Basic: partial mastery of prerequisite knowledge and skills that are fundamental for proficient work at each grade. Proficient: solid academic performance for each grade assessed. Students reaching this level have demonstrated competence over challenging subject

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress matter, including subject-matter knowledge, application of such knowledge to real-world situations, and analytical skills appropriate to the subject matter. Advanced: superior performance. This innovative system for standards-based reporting allows information on what students know and can do to be compared with consensus judgments about what students at each grade level should know and be able to do. Thus, in addition to providing scale scores that portray the overall performance of groups of students, standards-based reporting provides percentages of the groups of students that are at or above the basic, proficient, and advanced performance levels. The NAEP approach to standards-based reporting offers many potential benefits: Aiding communication. Many, including NAGB and Congress, contend that standards-based reporting metrics hold more meaning for policy makers and other NAEP users than do reports on the current, arbitrary 300-, 400-, or 500-point reporting scales. Proponents believe that standards-based reporting facilitates communication and understanding of achievement results, stimulates public discourse, and serves to generate support for education. The current achievement levels are rigorous and allow policy makers to talk about goals for increasing the numbers of students performing at high levels. Performance standards serve an important evaluative function for NAEP's users. During the last decade, many state and commercial testing programs have adopted standards-based reporting metrics, and many educators, policy makers, and parents have come to expect reports that state whether observed performance levels are "good enough." It would be very difficult for NAEP to recuse itself from the current movement toward standards-based reporting. Providing detailed descriptions of prerequisite skills and knowledge. Proponents also contend that educators, curriculum developers, and other subject-area specialists will benefit from having descriptions of what it means to be basic or proficient or advanced in a discipline to help focus curriculum and instruction in key areas. However, even more detailed descriptions may be required if educators and curriculum experts are to get the most from NAEP results. Promoting improved performance. Another reason for developing and reporting challenging standards for achievement is to encourage teachers to teach and students to learn to high levels. NAGB contends that achievement levels will prompt America's progress toward high academic attainment. To date, however, there is a paucity of evidence to indicate whether this is the case. Although standards-based reporting offers much of potential value, there are also possible negative consequences as well. The public may be misled if they infer a different meaning from the achievement-level descriptions than is intended.

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress (For example, for performance at the advanced level, the public and policy makers could infer a meaning based on other uses of the label "advanced," such as advanced placement, that implies a different standard. That is, reporting that 10 percent of grade 12 students are performing at an "advanced" level on NAEP does not bear any relation to the percentage of students performing successfully in advanced placement courses, although we have noted instances in which this inference has been drawn.) In addition, the public may misread the degree of consensus that actually exists about the performance standards and thus have undue confidence in the meaning of the results. Similarly, audiences for NAEP reports may not understand the judgmental basis underlying the standards. All of these false impressions could lead the public and policy makers to erroneous conclusions about the status and progress of education in this country. The Achievement-Level-Setting Process During the development of frameworks for each of the main NAEP subject areas, NAGB's policy definitions of achievement levels are applied, resulting in more detailed subject-specific descriptions of performance expectations for each of the three achievement levels; these are known as the "preliminary achievement-level descriptions." As an integral part of the framework, these descriptions are intended to provide guidance for the development of items and tasks for the assessment. After the administration of the assessment, these performance standards are applied to the assessment results in a process known as achievement-level setting, the outcome of which is the reporting of NAEP results in terms of percentages of students performing at basic, proficient, and advanced achievement levels. Once achievement levels are set for a NAEP subject, they stay fixed for multiple administrations of the assessment, providing a mechanism to observe changes in these percentages over time and presumably a more policy-relevant reporting metric. When a new NAEP assessment based on a new or highly revised framework is developed and administered, new achievement levels would be set. In the NAEP achievement-level-setting process, through 1996 NAGB employed the most prevalent approach to standard setting currently in use. Using the Angoff approach and its variants, panels of raters are convened, are trained on the knowledge and skills of examinees at different levels, and, in the case of NAEP, are asked to refine the preliminary achievement-level descriptions and then estimate the probability that a hypothetical student at the boundary of a given achievement level will get an item correct. Thus, for each multiple-choice item on the assessment, panelists estimate three probabilities: (1) the probability that a student whose performance is on the border between basic and below basic will correctly answer the item, (2) the probability that a student whose performance is on the border between proficient and basic will correctly answer the item, and (3) the probability that a student whose performance is on the border

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress between advanced and proficient will correctly answer the item. For constructed-response items with multiple possible score levels, panelists are asked to estimate mean item scores for students at each of these same three boundaries. Item judgments are averaged across items and panelists to arrive at cutscores that distinguish the levels of performance. (For a detailed accounting of this methodology, see National Academy of Education, 1993a; Cohen et al., 1997; and Reckase, 1998.) Through 1997, NAGB has set and reported achievement levels in mathematics, reading, history, geography, and, most recently, science. (See Linn, 1998, for a brief review of NAEP's achievement-level-setting efforts from 1990 through 1996.) Although the impact of reporting by achievement levels is not yet clear, many evaluators and researchers have been critical of the process and the results. Key findings from past evaluations of the NAEP's achievement levels and the achievement-level-setting process are described next. SELECTED FINDINGS FROM PAST NAEP EVALUATIONS AND RESEARCH NAGB's achievement-level-setting procedures and results have been the focus of considerable review: Linn et al. (1991), for the NAEP Technical Review Panel; Stufflebeam et al. (1991), under contract to NAGB; the U.S. General Accounting Office (1993); and the National Academy of Education (1992, 1993a, 1993b, 1996). Collectively, these reviewers agreed that: The judgment task posed to raters is too difficult and confusing. In the application of the Angoff procedure to the NAEP achievement-level-setting context, raters are asked to estimate the probability that a hypothetical student at the boundary of a given achievement level will get an item correct. This requires raters to delineate the ways the student could answer the item, relate these to cognitive processes that students may or may not possess, and operationally link these processes with the categorization of performance at three different levels (Pellegrino et al., 1998). This judgment process represents a nearly impossible cognitive task (National Academy of Education, 1993a, 1996). There are internal inconsistencies in raters' judgments for different item types. On past NAEP assessments, notable differences were observed in the cutscores set for each achievement level depending on item difficulty, number of score levels specified in the item scoring rubrics, and response format—e.g., multiple choice versus constructed response. Method variance of this kind is problematic because it renders cutscore locations dependent on the mix of item types in the assessment, in addition to rendering questionable the verbal description of the meaning of achievement at each level (National Academy of Education, 1993a, 1996; Linn, 1998). The NAEP item pools are not adequate to reliably estimate performance

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress at the advanced levels. Evaluators have contended that, particularly at the advanced level, the item pools do not represent well the types of knowledge and skills that the NAEP achievement-level descriptions (and national subject-area content standards) portray as being required of students demonstrating advanced performance (Stufflebeam et al., 1991; National Academy of Education, 1996). Appropriate validity evidence for the cutscores is lacking. There has been a lack of correspondence between NAEP achievement-level results and external evidence of student achievement, such as course-taking patterns and data from other assessments (for example, the advanced placement examinations) on which larger proportions of students perform at high levels. Numerous external comparison studies conducted by the National Academy of Education supported the conclusion that NAEP cutscores between the proficient and advanced levels and between the basic and proficient levels are consistently set too high, with the outcome of achievement-level results that do not appear to be reasonable relative to numerous other external comparisons (National Academy of Education, 1993a, 1996; Linn, 1998). Neither the descriptions of expected student competencies nor the exemplar items are appropriate for describing actual student performance at the achievement levels defined by the cutscores. Discrepancies between the achievement-level descriptions and the locations of the cutscores create a mismatch between what students in the score range defined by the cutscores are said to be able to do and what it is they actually did on the assessment (Linn, 1998). Also, evaluators have repeatedly concluded that the knowledge and skills assessed by exemplar items do not match up well with the knowledge and skill expectations put forth in the achievement-level descriptions, nor do the exemplars provide a reasonable view of the range of types of performance expected at a given achievement level (National Academy of Education, 1993a, 1993b, 1996). Counterarguments are presented by Hambleton and Bourque (1991), Kane (1995), Mehrens (1995), and Brennan (1998). Collectively, past evaluators (Linn et al., 1991; Stufflebeam et al., 1991; U.S. General Accounting Office, 1993; National Academy of Education, 1992, 1993a, 1993b, 1996) have concluded that the achievement levels are flawed; they have recommended that the current achievement-level-setting results not be used for NAEP reporting, unless accompanied by clear and strong warnings that the results should be interpreted as suggestive rather than definitive because they are based on a methodology that has been repeatedly questioned in terms of its accuracy and validity (National Academy of Education, 1996:106). In its final report, the National Academy of Education panel reiterated its position on achievement-level setting. The panel stated (1997:115): Given the growing importance and popularity of performance standards in reporting assessment results, it is important that the NAEP standards be set in defensible ways. Because we have concerns that the current NAEP performance

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress standards are flawed, we recommend that the Governing Board and NCES undertake a thorough examination of these standards, taking into consideration the relationship between the purposes for which standards are being set, and the conceptualization and implementation of the assessment itself. In addition, any new standards need to be shown to be reliable and valid for the purposes for which they are set. 1996 SCIENCE ACHIEVEMENT-LEVEL SETTING The NAEP achievement-level-setting process has evolved over time, in part in response to the evaluations summarized above, although variants of the modified Angoff procedure have remained in place. In accordance with the committee's charge from Congress, we reviewed the processes used to develop achievement-level descriptions and set achievement levels for the 1996 main NAEP science assessment, the most recent achievement level setting to be completed. Our review of the science achievement-level setting informed our evaluation of the current achievement-level-setting model and, in conjunction with the previous evaluations and research cited above, led us to conclusions about the current standards and process, as well as recommendations for future achievement-level-setting efforts. Although it would be hard to argue that earlier achievement-level-setting efforts were satisfactory, achievement-level setting for the 1996 science assessment was particularly troubling. The process suffered from many of the same difficulties identified by the National Academy of Education in previous efforts, and the solutions for dealing with these challenges ultimately led to the issuing of a report on science achievement levels that blurred the distinction between "what students can do" and "what students should be able to do" that standards-based reporting is intended to make clear. To set achievement levels for the 1996 NAEP science assessment, NAGB and its subcontractor for achievement-level setting, American College Testing, Inc. (ACT), used the same general modified Angoff method that was used in other disciplines. The result of this process (as in previous efforts) was that, at all three grade levels, very low percentages of students scored at or above the proficient level and almost no students reached the advanced level; at grade 4, an unusually high percentage (in comparison to other subjects) performed at or above the basic cutscore. For the first time in NAEP achievement-level setting, a rater group said they were dissatisfied with the process and their results; the grade 8 raters said they were not confident that their group held a common conception of student performance at the different levels. To examine and rectify the grade 8 problem, ACT reconvened the grade 8 raters to discuss their discontent and to reconsider the ratings, cutscores, and achievement-level definitions that they had generated at the initial rating session. In that session, the reconvened raters lowered the cutscores (increasing slightly the percentages of students scoring at or above proficient). It is important to note

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress that, originally and at the reconvention, the ratings by representatives of the public were more stringent than those of the educator and nonteacher educator raters. For past subjects, there were only minimal differences between the judgments of teachers, nonteacher educators, and general public representatives. In addition, there were also minimal differences when raters' judgments were aggregated and analyzed based on gender, racial/ethnic status, region, and school district size. At the same time, ACT explored a number of methods for recalculating the cutscores; however, none of these adjustments corrected the fundamental problem—that some cutscores appeared to be unreasonably high and others too low. In February 1997, NAGB's achievement-level committee met to consider the original results and the reconvention results and used their own expert judgments—in this case, representing a policy perspective rather than a disciplinary one—to set "reasonable" science cutscores. The committee examined the 1996 results in combination with external comparative information, including grade 8 results from the Third International Mathematics and Science Study (TIMSS), advanced placement examination results from the same cohort of students as the NAEP grade 12 sample, and NAEP achievement-level results in other disciplines. The achievement-level committee recommended resetting seven of the nine cutscores. In an undocumented process, the committee moved four cutscores up, five cutscores down, and left two cutscores as set by the science raters. Based on the resetting of the cutscores, as many as 40 items (from grade-level pools of approximately 190) moved from one achievement level to another (i.e., items that previously had been mapped as "proficient" now mapped as ''advanced"). These post hoc adjustments to the cutscores recommended by the raters indicated that NAGB now questioned the method it had relied on previously to justify the setting of high standards in other disciplines (e.g., history and geography) and reiterated the findings of previous evaluation panels that cutscores derived through the current process lead to unreasonable results. Thus, in the case of science, NAGB's own examination of the external comparative data led it to the same conclusion that multiple evaluation panels had reached: that the results of the achievement-level-setting process were not believable. NAGB then continued to pursue consideration of their own reset cutscores. In March 1997, the full National Assessment Governing Board adopted the adjusted cutscores as interim results and asked ACT and NAGB staff to develop new achievement-level definitions that would correspond to these new cutscores and to continue examining the results, using then-forthcoming TIMSS results for other grades and other external data. In April 1997, NAGB staff charged a group of science educators who had been involved with the development of the 1996 NAEP science framework and assessment with developing new achievement-level descriptions to correspond to the adjusted levels. This group examined the new cutscores and the items positioned at those cuts and determined that new achievement-level descriptions

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress based on those items would be at variance with NAGB's policy definitions; that is, they judged that the knowledge and skills tested by items bounded by the new cutscores were inconsistent with generic descriptions of basic, proficient, and advanced performance and with NAGB's press for high standards. They elected not to author definitions. Given this science panel's conclusions, NAGB's executive committee decided later in April to defer issuing the interim science achievement levels, which had been scheduled to be issued with the NAEP 1996 Science Report Card in early May (O'Sullivan et al., 1997). In June 1997, NAGB impaneled another group of science educators to examine the full range of items that mapped to each new achievement level, determine the knowledge and skills assessed by these items, and author descriptions based on their observations of the items and student response data. The panel used behavioral anchoring techniques to describe the levels and select exemplar items. The three grade groups successfully completed the tasks and generated new descriptions. They offered little note of any discord between the behavioral anchoring-based descriptions and the policy definitions; although such discord would have been predicted based on the April science panel's conclusion that the items bounded by the new cutscores were no longer consistent with the policy definitions. It is important to note that the behavioral anchoring-based descriptions differ from both the preliminary achievement-level descriptions and the achievement-level descriptions developed by the original group of raters, in that they do not describe what students should know and be able to do in science at basic, proficient, and advanced levels; instead, they portray what students currently know and can do in science at levels prescribed by the adjusted cutscores approved by NAGB. Thus, instead of laying out performance standards and then determining what percentages of students met those standards, NAGB determined cutscores that, based on their policy judgment, provided reasonable percentages of students at each of the three levels, and then asked the science educators to use behavioral anchoring techniques to analyze items and describe what students at those set levels could do. NAGB met in August 1997 to consider these results and make recommendations about science achievement-level reporting. At that meeting, NAGB's achievement-level committee reviewed the data, new definitions, and exemplar items; it considered whether the new levels describe what U.S. students currently know and can do or depict what they should know and be able to do. The committee argued that Congress charged them to develop "useful" performance levels, and that the adjusted cutscores and resulting definitions better met that end than the descriptions and levels that had been produced through the original achievement-level-setting process. The achievement-level committee recommended that NAGB release the new achievement-level descriptions and the adjusted achievement-level results. NAGB expressed satisfaction with the application of behavioral anchoring methods to replace the original achievement-level descriptions; they approved the

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress adjusted cutscores, definitions, and exemplar items and recommended that results be reported. The science achievement-level report was issued in October 1997. Recognizing that the achievement-level descriptions developed through the behavioral anchoring no longer represented conceptions of what students should know and be able to do, in the summary report, titled What Do Students Know?, NAGB presented the definitions of basic, proficient, and advanced as "what students know and are able to do" (National Assessment Governing Board, 1997). It is not clear that the significance of this distinction was recognized by the press, public, or other users of NAEP results. In summary, in the 1996 NAEP science achievement-level-setting effort, instead of reporting achievement results relative to an established standard of performance as in NAEP's previous achievement-level reports, the science report presented results that were based on NAGB's judgment as to what constituted reasonable percentages of students at the three achievement levels. NAGB had rejected the achievement-level-setting process that it had previously used to set achievement levels in other subjects, replacing it with an ad hoc process in which NAGB reset many of the cutscores. However, neither the process nor the judgments used were described in any detail in the report on science achievement levels. THE COMMITTEE'S EVALUATION The Value of Standards-Based Reporting in NAEP Despite the very serious continuing difficulties with the achievement-level-setting process and the blurring of the "can do"/"should be able to do" distinction that occurred in reporting NAEP's achievement-level results for science, the concept of standards-based reporting still appears to have the potential to be a significant improvement in communicating about student achievement to the public and to policy makers. However, there is not an extensive body of research on the ways in which standards-based information is interpreted and used by the various audiences for NAEP reports. In a study of press reports from the 1994 main NAEP reading assessment and the 1996 main NAEP mathematics assessment, Barron and Koretz (in press) found that achievement levels were the most popular reporting metric, with the most commonly reported statistic being the percentage of students reaching the proficient level. In a similar, earlier analysis of press reports from the 1990 main NAEP mathematics assessment, Koretz and Deibert (1995/1996) found that the achievement-level metrics were used extensively in reporting national and state results, but less often in reporting differences between major subgroups. Much of the additional research that does exist has focused on alternative forms of standards-based reporting (Hambleton, 1997; Koretz and Deibert, 1995/1996; Burstein et al., 1996). One likely reason for the

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress dearth of detailed research about how NAEP's users interpret and use achievement-level results is that the idea of reporting performance against standards is such an obvious improvement over an abstract and artificial proficiency scale that the perceived need for or value of such research is low. The NAEP performance standards developed by NAGB represent an extension of the national educational goals first proposed by President George Bush and the state governors in 1989 (Alexander, 1991). In its report on the science achievement levels, NAGB (1997:5) makes it clear that the goal is that ''all students should be proficient." Having such a goal provides a clearer basis for assessing progress, since the significance of a 10-point gain on a 0 to 500 proficiency scale is difficult to assess. Viewing improvement as the percentage of students at or above the proficient level of achievement—with a target of 100 percent—provides added meaning in a clear and easy to understand metric. Evidence of the perceived value of NAEP's standards-based reporting is given by the fact that Education Week's report Quality Counts (1998) reported NAEP state-level mathematics and science results entirely in terms of the percentage of students at or above the proficient level of achievement, even though the initial NAEP science Report Card had presented results only on the numeric proficiency scale (O'Sullivan et al., 1997) and even though the report of science achievement results provided achievement-level descriptions of student performance based on what students "can do" rather than what students "should be able to do." State assessment programs also increasingly are taking NAEP's lead in reporting by performance standards. Another cited benefit of standards-based reporting is the potential impact on curriculum development and instruction; however, there also has been a lack of good research on the impact of the achievement-level descriptions on these areas. Rich, multifaceted descriptions of student knowledge and skills at each achievement level could help teachers, teacher educators, and curriculum developers focus their instruction on areas judged to be most critical to proficient performance. However, because NAEP does not provide student-level or school-level data, it is difficult to conceive of it as a source of information whereby teachers would know where their own classes stand relative to the achievement levels. Furthermore, the lack of systematic diagnostic information related to particular elements in the achievement-level descriptions limits the capacity to identify specific deficiencies at either the state or the national level. Improvements in this area are possible, however; in Chapter 4, we urged NAEP to produce more indepth interpretations of student performance that can be derived from analyses of student responses across and within items; such interpretations are likely to enhance understanding of the meaning of performance at each of NAEP's achievement levels and may have the potential to provide some basic guidance to educators about areas of strength and weakness at state and national levels. We also recommend an addition to the current reporting of NAEP achievement-level results—the provision of descriptions of what students who are "below

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress its criterion of reasonableness" (National Assessment Governing Board, 1997:5). This reasonableness criterion is not further discussed, nor is any reference to a further discussion provided. We are concerned that the process by which the science achievement levels were set is not readily replicable, primarily because the criterion used to judge reasonableness and the rules or process used to make adjustments when initial results failed the reasonableness criterion are not well documented. The report mentions TIMSS, advanced placement information, and NAEP results from other disciplines as points of comparison in judging the reasonableness of the proposed science achievement levels. In order to ensure some level of consistency in future efforts, it would be helpful to understand how this other information was used, the criterion for determining how large a discrepancy between proposed achievement-level results and external data would lead to a change in achievement-level cutscores, and how the magnitude and direction of a change in the location of cutscores was decided. The report of science achievement levels states clearly that the levels are based on the judgment of the National Assessment Governing Board. The judgmental nature of the achievement levels was less clear in earlier reports. The committee recommends that NAGB should more explicitly communicate that the achievement levels result from an inherently judgmental process to avoid any false impressions that the achievement levels reflect some deeper scientific truth. The reports also should describe more fully the means by which judgments are made and criteria applied in determining the reasonableness of the achievement levels that result from these judgments. The Achievement-Level-Setting Process Prior reviews, beginning with the Stufflebeam et al. review (1991), which was commissioned and then rejected by NAGB, and continuing through reviews by Linn et al. (1991), the U.S. General Accounting Office (1993), the National Academy of Education's Trial State NAEP evaluation panel (1992, 1993a, 1993b, 1996), have all expressed concern with the process and the results of NAEP achievement-level-setting procedures. After reviewing the process and the results of the achievement-level setting for science, we concur with these past evaluators that NAEP's procedures and results are fundamentally flawed. Our conclusion that the current procedures are fundamentally flawed is based on three factors. First, the results are not believable. A primary concern is that too few students are judged to be advanced relative to many other common conceptions of advanced performance in a discipline (e.g., advanced placement course work). NAGB itself did not accept that the numbers of students judged to be advanced in science at all three grades were reasonable on the basis of results of the current process. A second reason is that achievement-level-setting results vary significantly depending on the type and difficulty of the items used in the judgment process. Constructed-response items typically result in higher cutscores than those set

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress using multiple-choice items. A similar result holds for easier versus difficult items. A third reason for our conclusion is research that suggests that panelists have difficulty in estimating accurately the probability that test items will be answered correctly by students with specified characteristics (Shepard, 1995). Panelists have particular difficulty in estimating the probability that an item will be answered correctly by a hypothetical student whose performance is at the borderline of two achievement levels (Impara and Plake, 1998). Even if panelists could judge the relative difficulty of different items, any constant error in estimating p-values will accumulate to a potentially significant bias in the overall sum (Linn and Shepard, 1997). The same concerns apply to the mean estimation procedures used with constructed-response items with multiple possible score levels. These concerns are particularly critical with respect to the advanced level. Students at this level will get most of the items right most of the time. Systematic biases toward overestimating high probabilities and underestimating low probabilities (e.g., a tendency to "round up," for example from 0.95 to 1.00) will create the bias toward higher achievement levels that has been of great concern in the NAEP achievement-level-setting results. NAGB's own rejection of the results of the science achievement-level setting and the imposition of their own judgment to set final levels demonstrates the critical need for an alternative paradigm and methods. We recommend that the current model for setting achievement levels be abandoned. A new approach is needed for establishing achievement levels in conjunction with the development of new NAEP frameworks for assessments to be administered in 2003 and later. Alternatives should be explored—including those that avoid complex item-level judgments and rest instead on judgments about larger aggregations of student performance data. Although we (and many critics) have pointed to deficiencies with current procedures, no clearly proven alternatives exist. We are not optimistic that substantial improvement will be realized by the modest alternatives currently being considered by NAGB's technical advisers and contractors, most of which represent minor variations on the way item-by-item judgments are collected and processed, although the contrasting groups method (National Academy of Education, 1993a; McLaughlin et al., 1993) and some newer alternatives, such as the "bookmark" procedure (Lewis et al., 1996), which does not involve averaging item-by-item estimates, may be worthy of investigation for NAEP's future achievement-level-setting efforts. In Appendix D, we present the initial conceptual framing for a model that (1) relies on the solicitation of judgments about aggregates of student performances, (2) uses comparative data to help ensure the reasonableness of the results, and (3) brings policy makers and educators together to set standards in a setting in which each group can benefit from hearing and understanding the perspectives of the other. We hope that it can stimulate discussion about future achievement-level-setting alternatives. In the authorizing legislation for NAEP (P.L. 103-382) Congress stated that

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress NAEP's student performance levels shall be used on a developmental basis until the commissioner of NCES determines, as a result of a congressionally authorized evaluation, that such levels are reasonable, valid, and informative to the public. Given the flawed current achievement-level-setting process, attendant concerns about the validity of the current achievement levels, and the lack of proven alternatives, NAEP's current achievement levels should continue to be used on a developmental basis only. If achievement-level results continue to be reported for re-administrations of assessments in which achievement levels have already been set (e.g., the 1998 reading report, the 2000 mathematics report), then the reports should adhere to the following guidelines: Strongly and clearly identify the developmental basis of the achievement levels, emphasizing that they should be interpreted and used with caution, given the continuing serious questions about their validity and Focus on the content of the reports on the change, from one administration of the assessment to the next, in the percentages of students in each of the categories determined by the existing achievement-level cutscores (below basic, basic, proficient, and advanced), rather than focusing on the percentages in each category in a single year. Even when the process used to determine cutscores and ascribe meaning to the achievement-level categories is flawed, tracking changes in the percentages of students performing at or above those cutscores (or in fact, any selected cutscore) can be of use in describing changes in student performance over time (see also Linn, 1998). Regardless of the specific alternative that is used for future achievement-level settings, three general aspects of the process should be addressed: (1) the role of preliminary achievement-level descriptions in assessment development, (2) the role of various participants in the achievement-level-setting process, and (3) the use of normative and external comparative data to evaluate the reasonableness of the achievement levels during and after the level-setting process. We next discuss each of these and present related recommendations for future achievement-level-setting efforts. Role of the Preliminary Achievement-Level Descriptions The function of preliminary achievement-level descriptions in assessment development for the main NAEP assessments has not been not well specified or well documented. The current science assessment was the first NAEP subject-area assessment for which preliminary achievement-level descriptions were developed along with the frameworks and, even so, they were somewhat of an afterthought. (A subset of the science framework steering and planning committees was reconvened after the framework had been completed and given limited time to develop the preliminary achievement-level descriptions that were included in the framework document.) Furthermore, as discussed in Chapter 4, it is

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress not clear to what degree NAEP's final item pools have reflected the knowledge and skills put forth in the preliminary achievement-level descriptions. Preliminary achievement-level descriptions should guide the development of assessment items and exercises (see also National Academy of Education, 1993a). Because reporting results in terms of achievement levels is a primary goal, the frameworks and assessments must be developed with this in mind. Preliminary achievement-level descriptions should be an integral part of NAEP's frameworks and should play a key role in guiding the development of assessment materials, including scoring rubrics. Furthermore, items and tasks should be written and rubrics defined to address the intended achievement levels. Items and tasks should be developed to maximize information about student achievement at the three critical cutscores and, to the extent that individual items are used as exemplars, they should be closely aligned with the knowledge and skills identified in the achievement-level descriptions. Thus, greater attention should be devoted to the development of preliminary achievement-level descriptions during the framework development process. This effort must involve educators in the subject area who are familiar with levels of student work in the target subjects and grades. After the framework and the preliminary achievement-level descriptions are developed, it is critical to have continued communication between the committee that developed the framework and the descriptions and those groups that have responsibility for developing the assessment (the assessment development committee and NAEP's assessment development subcontractors). At the very least, members of the framework committee should review assessment materials and provide feedback at an early stage of the development process regarding the degree to which the assessment materials reflect the framework and the preliminary achievement-level descriptions. (The existing NAEP subject-area standing committees should play an important role in ensuring that this review and feedback occurs.) This strategy reiterates one of our major recommendations from Chapter 4—the need for greater coherence across all phases of the framework and assessment development process. A tighter alignment between the assessment materials and the preliminary achievement-level descriptions is important, and accomplishing this is likely to require that the preliminary achievement-level descriptions be more informative than they are currently. Table 5-1 shows an analysis (rearrangement) of the preliminary and final achievement-level descriptions for eighth-grade science at the proficient level.1 The preliminary description on the left is quite general and 1   As noted previously in this chapter, the final science achievement-level descriptions were unusual in that they were developed inductively from item-level data using behavioral anchoring methods after NAGB had reset the achievement levels. NAGB warns against comparing these descriptions to the preliminary descriptions or descriptions for achievement levels in other subject areas because of this difference in how they were developed. We use them here simply to provide an example of a level of detail that one should include in preliminary achievement-level descriptions if they are to be helpful in guiding assessment development.

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress TABLE 5-1 Analysis of Preliminary and Final Achievement-Level Descriptions for the Grade 8 Proficient Level Preliminary Final Experiments and Data Experiments and Data 1. Collect basic information and apply it to the physical, living, and social environments 6. Design plans to solve problems 4. Design experiments to answer simple questions involving two variables 2. Design an experiment and have an emerging understanding of variables and controls 5. Isolate variables 1. Create, interpret, and make predictions from charts, diagrams, and graphs based on information provided to them or from their own investigations 6. Collect and display data and draw conclusions from them 3. Read and interpret geographic and topographic maps   17. Are able to develop their own classification system based on physical characteristics Relationships Relationships 2. Link simple ideas in order to understand payoffs and trade-offs 4. Use and understand models 3. Understand cause-and-effect relationships such as predator/prey and growth/rainfall 5. Partially formulate explanations of their understanding of scientific phenomena 7. Draw relationships between two simple concepts;   8. Begin to understand relationships (such as force and motion and matter and energy)   Other Subject-Area Knowledge Other Subject-Area Knowledge (Physical) 9. Begin to understand the laws that apply to living and nonliving matter 11. Have an emerging understanding of the particulate nature of matter, especially the effect of temperature on states of matter   12. Know that light and sound travel at different speeds   13. Can apply their knowledge of force, speed, and motion   7. Begin to identify forms of energy and describe the role of energy transformations in living and nonliving systems   8. Have knowledge of organization, gravity, and motions within the solar system   10. Have some understanding of properties of materials   Other Subject-Area Knowledge (Biological)   15. Understand that organisms reproduce and that characteristics are inherited from previous generations

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress Preliminary Final   16. Understand that organisms are made up of cells and that cells have subcomponents with different functions   14. Demonstrate a developmental understanding of the flow of energy from the sun through living systems, especially plants   Other Subject-Area Knowledge (Earth Science)   9. Can identify some factors that shape the surface of the Earth   18. Can list some effects of air and water pollution   19. Demonstrate knowledge of the advantages and disadvantages of different energy sources in terms of how they affect the environment and the economy NOTE: Numbers indicate the sequence in which the listed phrases occurred in the actual text of the grade 8 proficient achievement-level descriptions. does not appear to be very useful in developing item content or adjusting factors that may affect item difficulty. The final description on the right provides a good deal of information to inform the content of items that may differentiate proficient from basic performance, as well as some information to inform skills (e.g., experimental design, interpretation of graphical information) that may be assessed. It is important to note, however, that although these more prescriptive descriptions would be helpful in future assessment development, there is also a danger that they could be overly limiting. They provide examples of many but not all of the areas in the framework, and it would be a mistake to limit assessment development to just the areas touched on in these descriptions. Role of Various Participants in the Achievement-Level-Setting Process NAEP's current achievement-level-setting process is designed to include individuals with a range of perspectives and areas of expertise. Panelists include teachers and curriculum specialists in the subject area for which achievement levels are being set, as well as members of the public, many of whom apply knowledge of the subject area in their work. The composition of the achievement-level-setting panels has been specified in detail and the process for identifying participants has been carefully planned, carried out, and documented. In the end, however, NAGB rejected the 1996 science panel's recommendations on the

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress basis of reasonableness, largely on the basis of normative and comparative data that were not available to the panelists. NAGB has both the authority and the responsibility to make final decisions with respect to NAEP achievement levels. The carefully balanced, bipartisan composition of NAGB should make it well suited to balance policy, practical, and technical considerations in setting goals for student achievement. It is not clear, however, that NAGB is making the best possible use of the different forms of expertise available to inform its judgments. The many types of individuals selected for the panels bring important knowledge and perspectives to bear on the achievement-level-setting process. Curriculum specialists understand how different areas of achievement relate to curricular objectives; teachers have a deep understanding of students in a given grade and what can reasonably be expected of them; members of the business community and the public provide important input on the importance of different skills for success later in life. We recommend that the roles of educators, policy makers, the public, NAGB, and other groups in developing achievement-level descriptions and setting achievement levels should be specified more precisely. In particular, the roles of disciplinary specialists and policy makers should be better integrated throughout the achievement-level-setting process. Curriculum specialists and teachers should play a larger role in providing information about how and why achievement-level cutscores are set when final, policy-informed decisions about setting achievement-level cutscores are made. In addition, members of NAGB should be involved in the achievement-level-setting discussions among curriculum specialists, teachers, and the public so that they better understand the rationale underlying the panelists' recommended cutscores. All of these groups, through NAGB, should have a role in establishing and reviewing the process and the resulting achievement levels. Use of Normative and External Comparative Data Many experts argue that the data-based and policy consequences of the results of standard setting should be known to the achievement-level-setting raters early in their deliberations; thus normative student performance data and external comparative data should be considered by raters in setting NAEP achievement levels, primarily for use in evaluating the reasonableness of level-setting decisions. NAEP raters have learned about the consequences of their cutscore decisions—that is, the numbers of students scoring at or above the levels that they had just set—at the close of achievement-level setting, not during the determination of the levels. In adjusting the problematic standards for the 1996 NAEP science assessment, however, benchmark data from other assessments (advanced placement examinations, the SAT, TIMSS) played an important role in NAGB's designation of cutscores. Existing internal and external consequences or comparative

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress data should also be available to inform achievement-level-setting panelists' judgments. In the future, we hope that a broader range of data on consequences would be available to inform achievement-level-setting efforts. During a December 1996 workshop on standard setting that the committee sponsored as part of our evaluation of NAEP, different types of data on consequences were discussed. (Papers presented at the workshop were published in the January 1998 issue of Applied Measurement of Education.) In other arenas in which standards are set, explicit consideration is given to what would happen if standards are or are not met. Military enlistment standards, for example, are set based on the likelihood that individuals at different score levels will be able to successfully complete training (Hanser, 1998). Training standards are set based on the likelihood that an individual will perform adequately on the job. Environmental and nutritional standards are set on estimates of the probability of illness or fatalities at different levels (Goldberg, 1998; Jasanoff, 1998). However, we are far from having a clear consensus on the kinds of educational consequences (e.g., college entrance and success, career success, good citizenship) to which student achievement should be linked. Nonetheless, longitudinal data do exist, and more could be collected over time, allowing achievement scores at different age and grade levels to be related to a variety of consequences. For example, if twelfth-grade performance standards were based, in part, on the likelihood of success in college, then eighth-grade standards may be set on the basis of the likelihood of meeting the twelfth-grade standards and fourth-grade standards set on the basis of the likelihood of meeting eighth-grade standards. At the very least, such information could be corroborative, providing additional checks on the reasonableness of proposed or existing achievement-level standards. ACHIEVEMENT-LEVEL SETTING IN FUTURE NAEP ASSESSMENTS Achievement-level setting in NAEP is still very much in a developmental stage. As new models are explored and efforts undertaken in conjunction with assessments based on new or revised subject-area frameworks, the focus should be on standard setting for the large-scale survey assessments in the core subjects of reading, writing, mathematics, and science. In previous chapters of this report, we have recommended the use of multiple assessment methods in NAEP, both for assessing those aspects of the core subject-area frameworks that are not well assessed in a large-scale survey format and for assessing subject areas that are not assessed frequently enough to establish ongoing trend lines. Data obtained through these multiple methods would undoubtedly provide a rich source of information to aid in setting achievement levels but would also add to the complexity of the process; however, in the short term, it is judicious to focus on achievement-level setting using data from the

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress large-scale survey portions of the NAEP reading, mathematics, writing, and science assessments. Thus, in our proposed structure of new paradigm NAEP, achievement-level setting and reporting of results would initially be focused on the core NAEP component only. It seems more important to ''get it right'' in these subject areas, using data from one assessment methodology, than to devote resources to setting achievement levels based on multiple methods or in all of NAEP's subject areas. Eventually, however, we envision an achievement-level-setting process in which all available information that describes student achievement in a subject area, gleaned from across multiple assessment methods, would be used to inform the setting of NAEP's achievement levels. MAJOR CONCLUSIONS AND RECOMMENDATIONS Conclusions Conclusion 5A. Standards-based reporting is intended to be useful in communicating student results to the public and policy makers. However, sufficient research is not yet available to determine how various audiences interpret and use NAEP's achievement-level results. Conclusion 5B. Standard setting rests on informed judgment, but the complexity of NAEP's current achievement-level-setting procedures can create the misleading impression that level setting is a highly objective process, rather than a judgmental one. Conclusion 5C. The role of the preliminary achievement-level descriptions in item development is not well specified. Conclusion 5D. The roles of various participants in the achievement-level-setting process are not well integrated across the stages of the process. Conclusion 5E. NAEP's current achievement-level-setting procedures remain fundamentally flawed. The judgment tasks are difficult and confusing; raters' judgments of different item types are internally inconsistent; appropriate validity evidence for the cutscores is lacking; and the process has produced unreasonable results. Furthermore, NAGB rejected as unreasonable the outcomes of the 1996 achievement-level setting for science.

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress Recommendations Recommendation 5A. The current process for setting achievement levels should be replaced. New models are needed for establishing achievement levels in conjunction with the development of assessments based on new NAEP frameworks. Recommendation 5B. NAEP's current achievement levels should continue to be used on a developmental basis only. If achievement-level results continue to be reported for future administrations of assessments in which achievement levels have already been set, the reports should strongly and clearly emphasize that the achievement levels are still under development, and should be interpreted and used with caution. Reports should focus on the change, from one administration of the assessment to the next, in the percentages of students in each of the categories determined by the existing achievement-level cutscores (below basic, basic, proficient, and advanced), rather than focusing on the percentages in each category in a single year. Recommendation 5C. NAGB should explicitly communicate that achievement levels result from an inherently judgmental process. They should describe more fully the means by which judgments are made. NAGB should also clearly explain the criteria for determining the reasonableness of the achievement levels that result from these judgments. Recommendation 5D. Preliminary achievement-level descriptions should be an integral part of NAEP's frameworks and should play a key role in guiding the development of assessment materials, including scoring rubrics. Items and tasks should be written and rubrics defined to address the intended achievement levels. Preliminary achievement levels for advanced performance in the content domains need to be clarified. Recommendation 5E. The roles of educators, policy makers, the public, NAGB, and other groups in developing achievement-level descriptions and setting achievement levels should be specified more precisely. In particular, the roles of disciplinary specialists and policy makers should be better integrated throughout the achievement-level-setting process. All stakeholder groups, perhaps through NAGB, should have a role in establishing and reviewing the process and the resulting achievement levels.

OCR for page 162
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress Recommendation 5F. Normative student performance data and external comparative data should be considered by raters in setting NAEP achievement levels, primarily for use in evaluating the reasonableness of level-setting decisions. Recommendation 5G. Achievement-level reports should provide information about what students below the basic level can and cannot do. Recommendation 5H. In order to accomplish the committee's recommendations, NAEP's research and development agenda should emphasize the following: documentation and analysis of the impacts of standards-based reporting in NAEP on understanding and use of the results, development and implementation of alternate achievement-level-setting models, investigation and implementation of the use of normative and comparative data in determining achievement levels and evaluating their reasonableness, and analysis of similarities and differences between results of NAEP achievement-level-setting efforts and those associated with state and other testing programs.