APPENDIX D
Exploring New Models forAchievement-Level Setting
In Chapter 5, we suggested exploring models for achievement-level setting in which judgments focus on aggregates of student performance data, rather than on the accumulation of many item-level judgments. We also recommended the use of normative and external comparative data to assist in ensuring the reasonableness of the achievement-level cutscores. In this appendix, we provide the initial conceptual framing for a model of achievement-level setting for NAEP that relies on the solicitation of judgments about aggregates of student performance data and on the use of comparative data to help ensure the reasonableness of the results. We emphasize that this model has not been pilot-tested, even on a small scale; therefore, we have no empirical basis for evaluating its merits. However, we hope that this collection of ideas can stimulate discussion of alternatives for future achievement-level-setting efforts.
CONCEPTUAL DESCRIPTION OF ONE POSSIBLE MODEL
Step 1: Framework Development and Item Authoring
The first step in this model calls for simultaneous development of frameworks and preliminary achievement-level descriptions in NAEP disciplines. The subject-matter experts who develop NAEP frameworks would include individuals who are well positioned to describe the knowledge and skills that students performing at the basic, proficient, and advanced levels should exhibit at each of the grades assessed.
During assessment development, assessment materials (including draft scoring rubrics) would be developed to reflect the knowledge and skills addressed by the preliminary achievement-level descriptions. Items and tasks would be constructed
to specifically assess the knowledge and skills described in the preliminary achievement-level descriptions. Rubrics would be constructed to permit assessment of students' levels of understanding relative to the specified knowledge and skills.
Step 2: Item Mapping and Generation of Anchor Descriptions
After the assessment is administered, all items would be mapped onto the NAEP proficiency scale. The process of item mapping (described by O'Sullivan et al., 1997:6-9) results in the hierarchical ranking of items (or, in the case of constructed-response items with multiple scoring levels, the ranking of levels of responses to items) along the NAEP proficiency scale, with easiest items near the bottom of the scale and more difficult items at the top of the scale.
Following item mapping and based on the evaluation of the items and item-level data, a group of educators and other experts in the discipline (and including framework developers) would develop descriptions of the knowledge and skills that correspond to performance at selected points along the NAEP proficiency scale. For example, by analyzing the collection of items that map at or near selected points on the proficiency scale, behavioral anchor descriptions of aggregated student performance at increments along the scale could be developed.
Figure D-1 illustrates this second step, providing an illustrative set of behavioral anchor descriptions along the NAEP proficiency scale. These were developed using 1996 NAEP science assessment data from grade 8. The center column of the figure shows the NAEP proficiency scale at 20-point intervals from 80 to 260. Also shown are points along the proficiency scale for various percentiles for the grade 8 NAEP student population (e.g., 5 percent of the student population had a proficiency of 89 or below; 50 percent of the student population had a proficiency of 153 or below). The mean proficiency of the national grade 8 student population (148) also is shown.
The left-hand column shows behavioral anchor descriptions that we developed based on items that mapped within a ± 5-point interval around each of the anchor points on the diagram. For example, the behavioral anchor description at 160 represents a description of the aggregate of knowledge and skills achieved by students correctly answering the items that mapped between 155 and 165 on the proficiency scale (or, for constructed-response items, generating responses that correspond to scoring levels that mapped between 155 and 165). This set of behavioral anchor descriptions provides a view of student achievement arrayed along the NAEP proficiency scale.
If the developers who wrote the frameworks and preliminary achievement-level descriptions were able to lay out reasonable expectations for student performance in those descriptions, and if assessment materials were developed and student responses scored with the differences in levels of student performance on the preliminary achievement-level descriptions in mind, then the behavioral anchor
descriptions should bear at least some general similarities to the preliminary achievement-level descriptions. For example, it would be reasonable to expect that the level of knowledge and skills described in the preliminary achievement-level description for advanced performance would be reflected more frequently in the behavioral anchor descriptions at the upper end of the proficiency scale than in the middle or lower portions of the scale. We would not necessarily expect, however, that there would be a tight and complete correspondence of the behavioral anchor descriptions with the preliminary achievement-level descriptions. Since in this model the preliminary descriptions serve primarily as guides for assessment and scoring rubric development, a lack of correspondence between behavioral anchor descriptions and preliminary achievement-level descriptions can be accommodated, as described in later steps of the model.
Step 3: Mapping of Comparative Data
After the administration is completed, internal and external comparative data also can be mapped onto the NAEP proficiency scale. As illustrated in the right-hand column in the example in Figure D-1, these types of data could include mean proficiencies of various states participating in the assessment. Achievement-level data from other NAEP grade 8 assessments, achievement data from countries participating in TIMSS, and data from behavioral anchoring in the NAEP long-term trend assessments could also be mapped to corresponding percentile locations on the scale. While such direct comparisons of the latter three data collections to main NAEP science have serious limitations (Johnson, 1997; National Research Council, 1999), these data do provide some basis for comparison, since the sample of students assessed in main NAEP science, other main NAEP subjects, long-term trend NAEP assessments, and the TIMSS assessments all were nationally probability samples. (We recognize that one problem with mapping these comparative data directly on a diagram such as Figure D-1 is that this representation may suggest a stronger linkage between NAEP and other data collections than actually exists.)
Step 4: Achievement-Level Setting
In the fourth step, judges would be impaneled (including grade-level educators, disciplinary experts, and policy makers) to set standards by reviewing three kinds of data: (1) distribution data showing the percentage of students scoring at or above each score increment (i.e., the percentiles displayed in Figure D-1), (2) the behavioral anchor descriptions developed in Step 2, and (3) comparative benchmark data such as that described in Step 3. These performance benchmarks help place NAEP results in a broader context and should include comparison data from other assessments when appropriate and when they are available. Raters with differing expertise and policy interests would be assembled to perform the
setting of achievement levels. Ideally, this group would include members of the National Assessment Governing Board so that the discussions and decisions of the group of raters could be directly reflected in NAGB's decisions about the final achievement levels.
A variety of strategies could be employed to help raters utilize information such as that displayed in Figure D-1 in setting achievement levels. We describe key steps of one possible strategy here.
Raters should first consider the behavioral anchor descriptions and determine which descriptions best represent basic, proficient, and advanced performance. Raters would be guided by the policy descriptions, the preliminary achievement-level descriptions, and their own judgments about what constitutes basic, proficient, and advanced performance. Once a general proficiency range has been determined (e.g., that the 120-140 anchor descriptions describe basic performance, but the 160 description is proficient, and the 100 description is below basic), raters would examine where individual items mapped to more narrowly determine the specific proficiency at which a cutscore would be set. The key feature of this strategy is that raters would first consider aggregate data, and then move to item data only after having determined the general features of what constitutes basic, proficient, and advanced performance.
After these initial achievement-level cutscores are determined, raters would then examine normative data (percentiles) and comparative benchmark data to evaluate the reasonableness of the cutscores, and to inform the magnitude of any adjustments in cutscores that might be deemed necessary based on that evaluation. All raters, including subject-matter experts, policy makers, and members of the National Assessment Governing Board, would evaluate the reasonableness of the cutscores jointly, and together agree on any needed adjustments.
Once raters finalize their cutscores, the results would be forwarded to the full NAGB for review and approval (or adjustment). In their review, NAGB would have the full array of data displayed in Figure D-1 available to inform their decision making, as well as the raters' rationale for any adjustments made to their initial cutscores based on the evaluation of normative and comparative data. The rationale for decisions made by NAGB to adjust achievement levels submitted to them by the raters should be clearly described in the reports of achievement-level results.
Step 5: Revising the Achievement-Level Descriptions
After the final achievement levels are approved by NAGB, the achievement-level descriptions would then be revised (using behavioral anchoring techniques) to match the knowledge and skills represented by the items that map on the NAEP proficiency scale within the range of proficiencies associated with each of the final achievement levels.
This concept is not without its own challenges. For example, the processes
of item mapping and developing behavioral anchor descriptions have been the subject of some controversy (Forsyth, 1991). In particular, there is no universally accepted rule regarding where on the NAEP proficiency scale an item should be mapped: at the point where 50 percent of the students respond correctly? 65 percent? 80 percent? Ongoing research, some of it conducted for NAEP, has not resolved this issue.
If the ideas presented here are explored further, they undoubtedly would undergo significant revision. We do believe further discussion of the features of this model is warranted, as this method relies on rater judgments about aggregates of achievement data, permits evaluation of reasonableness using normative and comparative data, fosters joint participation in standard setting by policy makers and educators, and may result in a more easily understood achievement-level-setting process.
REFERENCES
Beaton, Albert E., Michael O. Martin, Ina V. S. Mullis, Eugenio J. Gonzalez, Teresa A. Smith, and Dana L. Kelly 1996 Science Achievement in the Middle School Years. Chestnut Hill, MA: Center for the Study of Testing, Evaluation, and Educational Policy, Boston College.
Campbell, Jay R., Kristin E. Voelkl, and Patricia L. Donahue 1997 NAEP 1996 Trends in Academic Progress: Achievement of U.S. Students in Science, 1969 to 1996; Mathematics, 1973 to 1996; Reading, 1971 to 1996; and Writing, 1984 to 1996. NCES 97-985. Washington, DC: U.S. Department of Education.
Forsyth, R.A. 1991 Do NAEP scales yield valid criterion-referenced interpretations? Educational Measurement: Issues and Practice (Fall):3-9.
Johnson, Eugene G. 1997 A TIMMS-NAEP Link. Unpublished paper prepared for the U.S. Department of Education, Washington, DC.
National Research Council 1999 Uncommon Measures: Equivalence and Linkage of Educational Tests. Michael J. Feuer, Paul Holland, Meryl W. Bertenthal, F. Cadelle Hemphill, and Bert F. Green, eds. Committee on Equivalency and Linkage of Educational Tests, Board on Testing and Assessment. Washington, DC: National Academy Press.
O'Sullivan, Christine Y., Clyde M. Reese, and John Mazzeo 1997 NAEP 1996 Science Report Card for the Nation and the States. Washington, DC: U.S. Department of Education.
Reese, Clyde M., Karen E. Miller, John Mazzeo, and John Dossey 1997 NAEP 1996 Mathematics Report Card for the Nation and the States: Findings from the National Assessment of Educational Progress. Washington, DC: U.S. Department of Education.