In its statement on the purposes and uses of the VNT, NAGB responded to the congressional request that it include "a description of the achievement levels and reporting methods to be used in grading any national test." Given that the stated purpose of the VNT is to measure individual student achievement and the stated use is to provide information describing the achievement of individual students, NAGB offered the following statements on reporting VNT results (National Assessment Governing Board, 1999e:11):
. . . results of the voluntary national tests [should] be provided separately for each student. Parents, students, and authorized educators . . . should receive the test results report for the student. Test results should be reported according to performance standards for [NAEP]. These are the NAEP achievement levels: Basic, Proficient, and Advanced. All test questions, student answers, and an answer key should be returned with the test results; it will be clear which questions were answered correctly and which were not. The achievement levels should be explained and illustrated in light of the questions on the test. Also, based on the nationally representative sample of students who participated in the national tryout of the test the year before, the percent of students nationally at each achievement level should be provided with the report.
This chapter considers the process leading to NAGB's statements to Congress regarding the reporting of VNT results, and it evaluates the stated plans for reporting results, as outlined in the document submitted to Congress. The committee reviewed reporting procedures with two criteria in mind: (1) Would the current plans result in formats accessible to parents, teachers, and students? (2) Would the current plans report results using meaningful metrics?
To accomplish this, the committee reviewed the following documents:
The Voluntary National Test: Purpose, Intended Use, Definition of Voluntary and Reporting (National Assessment Governing Board, 1999e)
Overview: Determining the Purpose, Intended Use, Definition of the Term Voluntary and Reporting for the Proposed Voluntary National Test (National Assessment Governing Board, 1999a)
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 68
Evaluation of the Voluntary National Tests, Year 2: Final Report 6 Reporting In its statement on the purposes and uses of the VNT, NAGB responded to the congressional request that it include "a description of the achievement levels and reporting methods to be used in grading any national test." Given that the stated purpose of the VNT is to measure individual student achievement and the stated use is to provide information describing the achievement of individual students, NAGB offered the following statements on reporting VNT results (National Assessment Governing Board, 1999e:11): . . . results of the voluntary national tests [should] be provided separately for each student. Parents, students, and authorized educators . . . should receive the test results report for the student. Test results should be reported according to performance standards for [NAEP]. These are the NAEP achievement levels: Basic, Proficient, and Advanced. All test questions, student answers, and an answer key should be returned with the test results; it will be clear which questions were answered correctly and which were not. The achievement levels should be explained and illustrated in light of the questions on the test. Also, based on the nationally representative sample of students who participated in the national tryout of the test the year before, the percent of students nationally at each achievement level should be provided with the report. This chapter considers the process leading to NAGB's statements to Congress regarding the reporting of VNT results, and it evaluates the stated plans for reporting results, as outlined in the document submitted to Congress. The committee reviewed reporting procedures with two criteria in mind: (1) Would the current plans result in formats accessible to parents, teachers, and students? (2) Would the current plans report results using meaningful metrics? To accomplish this, the committee reviewed the following documents: The Voluntary National Test: Purpose, Intended Use, Definition of Voluntary and Reporting (National Assessment Governing Board, 1999e) Overview: Determining the Purpose, Intended Use, Definition of the Term Voluntary and Reporting for the Proposed Voluntary National Test (National Assessment Governing Board, 1999a)
OCR for page 68
Evaluation of the Voluntary National Tests, Year 2: Final Report VNT: Issues Concerning Score Reporting for the Voluntary National Tests: Results of Parent and Teacher Focus Groups, (American Institutes for Research, 1999p) Selected Item Response Theory Scoring Options for Estimating Trait Values (from Wendy Yen to American Institutes for Research, 1999h) Score Reporting, Scoring Examinees, and Technical Specifications: How Should These be Influenced by the Purposes and Intended Uses of the VNT (American Institutes for Research, 1999g) VNT: Plans for Continuing Work in Score Reporting (American Institutes for Research, 1999q) VNT: Proposed Score Reporting Metrics and Examinee Scoring Algorithms for the Voluntary National Tests (American Institutes for Research, 1999r) SCORE COMPUTATION One of the primary recommendations of the NRC's year 1 report was that decisions about how scores will be computed and reported should be made before the design of the VNT test forms can be fully evaluated (National Research Council, 1999b:51). NAGB and AIR are developing and evaluating options for test use and have conducted focus groups that include consideration of reporting options, but no further decisions about score computation and reporting have been made. We believe that decisions are needed soon. Three options for computing student scores were identified in a paper prepared for the VNT developer's technical advisory committee (American Institutes for Research, 1999h). One option is pattern scoring, which assigns an overall score to each possible pattern of item scores, based on a model that relates examinee ability to the probability of achieving each observed item score. With pattern scoring, a student's score depends on the particular pattern of right and wrong answers. As a result, individuals with the same number of correct responses may get different scores, depending on which items were answered correctly. Conversion of response strings to reported scores is complicated, and it is not likely to be well understood nor accepted by parents, teachers, and others who will have access to both item and total scores. Another scoring approach is to use nonoptimal weights, with the weights determined according to either judgmental or statistical criteria. For example, easy items can be given less weight than more difficult items, multiple-choice items can be given less weight than open-ended items, or all items can be weighted in proportion to the item's correlation with a total score. Use of such a scoring procedure would make conversion from the number correct to the reported score less complex than with pattern scoring, but it is not as straightforward as a raw score approach. The raw score approach is the most straightforward method: a total score is determined directly by summing the scores for all the individual items. Multiple-choice items are scored 1 for a correct answer and 0 for an incorrect answer or for no answer. Open-ended response items are scored on a 2-, 3-, 4-, or 5-point scale according to a defined scoring rubric. The particular subset of items that are responded to correctly-whether easier or harder, multiple-choice, or open-ended-is not relevant. We agree with Yen's conclusion (American Institutes for Research, 1999h) that for tests with adequate numbers of items, different item weighting schemes have little effect on the reliability or validity of the total score. Given that test items and answer sheets will be available to students, parents, and teachers, we believe that this straightforward approach makes the most sense. RECOMMENDATION 6.1 Given that test items and answer sheets will be provided to students, parents, and teachers, as well as made available to the general public, test forms
OCR for page 68
Evaluation of the Voluntary National Tests, Year 2: Final Report should be designed to support scoring using a straightforward, total correct, raw score approach. The committee also suggests that careful thought be given to the manner of awarding points for constructed-response (open-ended) items. The final scaled score for a student will be based on the student's raw score, which is simply the number of points awarded. Since each correct response to a multiple-choice item is worth 1 point, care must be taken to ensure that each partial-credit point awarded for a response to a constructed-response item represents a correct portion of a response. The idea is that a response may need to include several correct assertions in order to be fully correct, so each of those assertions can receive a point. For example, the scoring rubric proposed for one mathematics item that asked students to draw a particular type of geometric figure called for 1 point for an incorrect drawing (with no additional constraints) and 2 points for a correct drawing. This might more appropriately be a 1-point item. Or, since the drawn figure had to satisfy two conditions, it might be appropriate to award 1 point for a figure satisfying one of the conditions and 2 points for a response satisfying both. In VNT: Proposed Score Reporting Metrics and Examinee Scoring Algorithms for the Voluntary National Tests (American Institutes for Research, 1999r), AIR notes a NAGB decision that "all VNT scoring rubrics for constructed-response items award a score of '1' only to those responses that are at least partially correct"; in contrast, NAEP awards a score of 1 to attempts at providing relevant responses. The committee agrees with this decision. Failure to consider the manner for awarding points could have a serious negative effect on the perceived validity of the scoring process among students, parents, and the public. Awarding points for erroneous work or having too large a disparity in the value of a point given for different responses will be noticed and will be hard to defend. RECOMMENDATION 6.2 Special attention should be given to the work required for receiving partial credit for constructed-response items that have full scores of more than 1 point. REPORTING SCALE For NAEP, the current practice is to summarize performance in terms of achievement levels (basic, proficient, and advanced) by reporting the percentages of students scoring at each achievement level. To remain consistent with NAEP, the VNT will also report scores using these achievement-level categories. One shortcoming of this type of reporting is that it does not show where a student placed within the category. For instance, was the student close to the upper boundary of basic and nearly in the proficient category? Or, did he or she only just make it over the lower boundary into the basic category? This shortcoming can be overcome if achievement-level reporting is supplemented with reporting using a standardized numeric scale, such as a scale score. This manner of reporting will show how close a student is to the next higher or next lower achievement-level boundary. This additional information will be particularly important in settings where a significant portion of the students score in the "below basic" category. To ensure clear understanding by students, parents, and teachers, the released materials could include a table that allows mapping of the possible scores to specific points on the standardized scale. A separate table would be needed for each VNT form to account for minor differences in the difficulty of each form. In addition, the cutpoints for each achievement level could be fixed and prominently displayed on the standardized numeric scale. Tests that report performance according to achievement-level categories might include a series of
OCR for page 68
Evaluation of the Voluntary National Tests, Year 2: Final Report probabilistic statements regarding achievement-level classification errors. For example, a score report might say that of the students who performed at the basic level, 68 percent would actually be at the basic level of achievement, 23 percent at the proficient level, and 5 percent at the advanced level. Although provision of information regarding the precision with which performance is measured is essential, we do not believe these sorts of probabilistic statements are particularly helpful for parents and do not recommend their use for the VNT. In conducting focus groups on reporting issues with parents and teachers, NAGB and AIR have developed a potential reporting format that combines NAEP achievement levels with a continuous score scale. In this format, error is communicated by a band around a point estimate on the underlying continuous scale. Reporting of confidence band information will allow parents to see the precision with which their children are placed into the various achievement levels without having to grapple with classification error probabilities. We believe that parents will find information about measurement uncertainty more useful and understandable if it is reported by means of confidence bands rather than as probabilistic statements about the achievement levels. RECOMMENDATION 6.3 Achievement-level reporting should be supplemented with reporting using a standardized numeric scale. Confidence bands on this scale should be used to communicate measurement error. As part of the reporting issues focus groups, NAGB asked parents and teachers to react to a sample score report (included as Appendix C in American Institutes for Research, 1999p). Two aspects of this sample report deserve further consideration. First, in an effort to communicate normative information, the distance between the lower and upper boundaries for each NAEP achievement level was made proportional to the percentage of students scoring at the level. The underlying scale thus resembles a percentile scale. The midpoint of this scale tends to be in the basic range. We are concerned that this format may emphasize the status quo, what students can do now, rather than the rigorous standards behind the achievement levels. Furthermore, scaling according to the distribution of students, while conveying potentially useful normative information, may lead to misinterpretations of the range of content covered by each of the achievement levels. In particular, the advanced level is represented by a very narrow box since very few students are currently at this level. One might conclude that there is relatively little material to be mastered beyond the lower boundary of the advanced range, which is clearly not the case. A second concern with the current NAGB/AIR scheme is the continued use of the NAEP score scale for labeling both the individual point estimates and the boundaries between achievement levels. The committee is concerned that the three-digit numbers used with the NAEP scale convey a level of accuracy that is appropriate for estimates of average scores for large groups, but that is inappropriate for communicating results for individual students. Figure 6-1 shows an example of how measurement error and resulting score uncertainty might be reported using confidence bands. In this example, we created a 41-point standard score scale with 10, 20, and 30 defining the lower boundary of the basic, proficient, and advanced levels. While 41 points may be a reasonable number of score levels, the issue of an optimal score scale cannot be decided independent of decisions about the difficulty of each test form and the accuracy of scores at different levels of achievement. It is likely, for example, that the final VNT accuracy targets will not yield the same amount of information at each achievement level. While the focus groups conducted by AIR have provided useful information on the type of informa
OCR for page 68
Evaluation of the Voluntary National Tests, Year 2: Final Report FIGURE 6-1 Sample reporting format combining achievement-level reporting with a standardized numeric scale. NOTES: The score of 11 on this test is in the basic range. The shaded area shows the range of probable scores if the test were taken again. Statistically, scores would be expected to fall within the shaded range (between 7 and 15) 95 percent of the time. tion that parents and teachers will understand and find useful, there is a great deal of work yet to be done. Some of the topics yet to be addressed include: reporting formats for the students themselves; the form and format in which items and student responses will be returned to students, parents, and teachers; what additional information about the test items (difficulty, content area, achievement level, etc.) will be provided to some or all of these audiences; ways in which teachers might aggregate item level results across students to identify areas of their curriculum that may need more emphasis; and how any test accommodations will be noted on the score reports to teachers and parents. The committee is particularly concerned about the absence of the student's perspective on reporting issues in the work conducted to date. Although focus groups may be useful in generating ideas, the committee is uncomfortable with basing final decisions about reporting formats only on the type of anecdotal information available from focus groups. More rigorous controlled experiments, possibly using cognitive laboratories to test understanding and use of alternative score reports, for example, might be conducted before decisions about reporting formats are reached.
OCR for page 68
Evaluation of the Voluntary National Tests, Year 2: Final Report The committee also realizes that preliminary decisions about reporting formats may need adjustment after the pilot test and again after the field test to reflect findings about the extent to which VNT scores can be compared with the NAEP achievement-level cutpoints. In addition, the field test should include a full operational test of reporting procedures. Results from the field test, including impact analyses, should be carefully reviewed to identify further reporting changes that might reduce sources of confusion and enhance understanding and use. SUBSCORE REPORTING Our call for a clearer statement of expectations for each mathematics content strand or reading stance is not meant to imply that separate scores for each strand or stance should be reported for each student. Test length considerations make it questionable whether subscores could be sufficiently reliable for individual students. A conceptual problem with subscore reporting is that the NAEP achievement levels are set for each subject as a whole. There is no established view of what it means to be proficient (or basic or advanced) within individual content areas that would define the domain of potential subscores. Unless additional achievement-level setting work is performed, subscale reporting would have to rely on an arbitrary numeric scale, supplemented by normative rather than criterion-referenced interpretive data. This concern is in addition to limitations on the reliability of subscores. RECOMMENDATION 6.4 Individual student performance on the VNT should not be reported at the subscore level. RECOMMENDATION 6.5 NAGB and its contractor should undertake research on alternative ways for providing item-level feedback to students, parents, and teachers. The options explored should include provision of information on item content and targeted achievement level, as well as normative information, such as passing rates. ITEM-LEVEL INFORMATION Perhaps the greatest distinction between the VNT and available commercial tests is the plan to release each test form immediately after operational use and to provide item-level information back to students, parents, and teachers. The developers have only begun to explore the potential value of the item-level information that will be provided. For the individual students, returning item-level information will provide a basis for diagnosing strengths and weaknesses within an overall subject domain that would be even better than subscores. For teachers, the item-level information will provide concrete illustrations of the types of problems that students should be able to solve in each content area. There are many options for how item-level data will be provided. At one extreme, booklets and answer sheets could simply be returned as originally marked, with a separate answer key for the multiple-choice questions and scoring guides for each of the constructed-response questions. At the other extreme, items might be arranged by content area and achievement level and accompanied by normative and other psychometric data (e.g., average item scores for all students and for specific subgroups), with narrative discussions tailored for each student explaining why each incorrect option selected by the student was incorrect. There could even be summaries of the number of items in each group that the student answered correctly, perhaps with comparative normative figures. Perhaps this type of item-level information would fulfill the desire for subscores. Unlike subscores, however, this information
OCR for page 68
Evaluation of the Voluntary National Tests, Year 2: Final Report would not consist of an arbitrary subscore scale or subscore standards, only the simple number and percentage of correct responses. In general, the committee favors taking maximum advantage of plans to return item-level information. Information should be provided on the content area (stance or content strand) that each item was intended to measure, as well as the achievement level to which each item is linked. Since teachers and parents may try to infer what each item was intended to measure, providing this information may improve the accuracy of their inferences. Sorting the items by achievement level within content area would further aid in understanding the domain that is being measured and the expectations for student performance within this domain. Providing item difficulty information (such as the percentage of students who responded correctly) would also help parents, teachers, and students understand the student's overall score. RECOMMENDATION 6.6 NAGB and its contractor should consider including students, particularly at the 8th-grade level, as well as parents and teachers in future focus groups on score reporting. AGGREGATION In its report to Congress on the purpose and uses of the VNT, NAGB adopts a stance that discourages but does not prohibit aggregation of individual student results. Specifically, (National Assessment Governing Board, 1999e:11): There should be no compilations of student results provided automatically by the program . . . However, it is virtually certain that compilations of student results will be desired and demanded by at least some of the state and district participants and possibly by private school participants. These participants should be permitted to obtain and compile the data at their own cost, but they will bear the full responsibility for using the data in appropriate ways and for ensuring that the uses they make of the data are valid. The reasons for discouraging aggregation are not fully explicit. The primary concern seems to be that the aggregate results for states or large districts will not agree with NAEP results. It is possible that there is also a concern with preventing inappropriate accountability uses, since the report further states that ''[NAGB] would develop and provide guidelines and criteria for use by states, districts, and schools for compiling and reporting [VNT data] in ways that are appropriate and valid" (National Assessment Governing Board, 1999e:11). There is an apparent contradiction between the adoption of the public policy model (see Chapter 2) and this stance on aggregation. The public policy model gives the primary decision to adopt the VNT to states and districts, but they are cautioned not to aggregate the scores. Why would these entities choose to administer the VNT to their students if aggregation of information at the district and state level is discouraged? At the committee's July meeting, it was suggested that the effort being put into preventing aggregation might be better spent explaining how the results will differ from NAEP and providing corresponding cautions in interpretation. (Yen, 1999). Some of the reasons for differences-specifically, differences in student motivation-actually suggest that aggregate VNT results might be a more valid indication of student accomplishments. Complex statistical procedures are used with NAEP to remove measurement error from estimates of differences in scores over time or among different subgroups of students. In part, these procedures are necessary because of the matrix sampling used with NAEP. Measurement error is an essential consideration because no one student takes a large and completely
OCR for page 68
Evaluation of the Voluntary National Tests, Year 2: Final Report representative sample of items. Measurement error will be less of an issue with the VNT because each student completes a representative sample of items, which is at least twice the size of the item samples completed by students in NAEP. Indeed, the VNT is designed to provide adequate score accuracy for individual students. Most commercial tests encourage aggregation of observed student scores without requiring complex statistic machinery to remove minor biases due to measurement error. If VNT results provide a valid assessment of the proficiency of individual students, it is perfectly legitimate for schools, districts, and even states to ask how many of their students meet these standards, or, alternatively, to track changes in means on a standard scale over time. If aggregate results are not provided to answer such questions, there may be very little motivation for districts and states to participate in the VNT. The committee recognizes that aggregated VNT results will be likely to differ from NAEP but there could be strong advantages to explaining rather than surpressing these differences. NAGB's Linking Feasability Team (Cizek et al., 1999) identified a number of differences between the VNT and NAEP specifications. Communicating these differences to VNT users might release the VNT from excessive requirements for "NAEP-likeness." For example, NAGB believes that the calculators used with the 8th-grade mathematics test should be as much like the calculators used in NAEP as possible, even though these calculators, which in NAEP are also used in the 12th-grade, contain trigonometric functions not covered in the 8th-grade frameworks. Relaxing the "NAEP-like" requirements would mean that schools (or NAGB) could buy simple "four-function'' calculators, which are much easier to explain to students unfamiliar with more advanced models and cost less than one, quarter as much. NAGB, for very good reason, has relaxed the NAEP-like requirements in the test specifications, limiting passage lengths in reading and increasing the proportion of machine-scorable items in mathematics. While these differences may limit the comparability of VNT and NAEP results, they should not be a reason to avoid supporting district-and state-level reporting of aggregated results. RECOMMENDATION 6.7 NAGB should support aggregation of test results for participating districts and states, while discouraging inappropriate, high-stakes uses of aggregated results. NAGB should develop explicit and detailed guidelines and practices for the appropriate compilation and use of aggregate data from administration of the VNT and should explain limitations on the validity of comparisons of aggregate results on the VNT to results from NAEP.