6
Using Innovations in Measurement and Reporting: Reporting Percent Correct Scores
A second aspect of the NAEP market basket is reporting results in a metric easily understood by the public. For some time, NAEP has summarized performance as scale scores ranging from 0 to 500. However, it is difficult to attach meaning to scores on this scale. What does a score of 250 mean? What are the skills of a student who scores a 250? In which areas are they competent? In which areas do they need improvement?
Achievement level reporting was introduced in 1990 to enhance the interpretation of performance on NAEP. NAEP's sponsors believe that public understanding could be further improved by releasing a large number of sample items, summarizing performance using percent correct scores, and tying percent correct scores to achievement level descriptions. Since nearly everyone who has passed through the American school system has at one time or another taken a test and received a percent-correct score, most people could be expected to understand scores like 90%, 70%, or 50%. Unlike the NAEP scaled scores, the percent correct metric might have immediate meaning to the public.
PERCENT CORRECT METRIC: NOT AS SIMPLE AS IT SEEMS
At first blush, percent correct scores seem to be a simple, straightforward, and intuitively appealing way to increase public understanding of NAEP results. However, they present complexities of their own. First, NAEP contains a mix of multiple-choice and constructed response items.
In preliminary stages of scoring, multiple-choice items are awarded one point if answered correctly and zero points if answered incorrectly. Answers to constructed response items are also awarded points, but for some constructed response questions, six is the top score, and for others, three is the top score. For a given constructed response item, higher points are awarded to answers that demonstrate more proficiency in the particular area. Furthermore, a specific score cannot be interpreted, even at this preliminary stage, as meaning the same level of proficiency on different items (e.g., a four on one item would not represent the same level of proficiency as a four on another item). This situation becomes more complex at subsequent stages of IRT-based scoring and reporting, and the concept of “percent correct” becomes meaningless. Therefore, in order to come up with a simple sum of the number of correct responses to test items that include constructed response items, one would need to understand the judgment behind “correct answers.” What would it mean to get a “correct answer” on a constructed response item? What would be considered a correct answer? Receiving all points? Half of the points? Any score above zero?
As an alternative, the percent correct score might be based, not on the number of questions, but on the total number of points. This presents another complexity, however. Simply adding up the number of points would result in awarding more weight to the constructed response questions than to the multiple-choice questions. For example, suppose a constructed response question can receive between one and six points, with a two representing slightly more competence in the area than a one but clearly not enough competence to get a six. Compare a score of two out of six possible points on this item versus a multiple-choice item where the top score for a correct answer is one. A simple adding up of total points would give twice as much weight to the barely correct constructed response item as to an entirely correct multiple-choice item. This might be reasonable if the constructed response questions required a level of skill much higher than the multiple-choice questions, such that a score of two on the former actually represented twice as much skill as a score of one on the latter. Since this is not the case for NAEP questions, some type of weighting scheme is needed. Yet, weighting schemes also introduce complexity to the percent correct metric.
A number of workshop participants addressed the deceptive simplicity of percent correct scores. Several pointed out that the public already has difficulty understanding terms that psychometricians use, such as national percentile rank or grade-level equivalents. As a result, assessment directors
spend a good deal of time trying to ensure that policymakers and the public make the proper inferences from test results. The danger of the percent correct score is that everyone might think they understand it due to their own life experience, when, in fact, they do not.
Still, it should be pointed out that the percent correct metric has much intuitive appeal. If used correctly it might be of great benefit in increasing understanding of NAEP. Moreover, all statistics are susceptible to misuse, percent correct as well as more complex statistics. As Ronald Costello, assistant superintendent public schools in Noblesville, Indiana, observed:
It doesn't matter what the statistic is, it still will be used for rank ordering when it gets out to the public. There are 269 school districts in Indiana. When test results come out, there's a 1 and a 269. The issue is why are we testing students and what do we want to do with the results.
Costello concluded by saying that more important than the statistic is the use of the results. Attention should be focused on making progress in educational achievement, and the statistic should enable evaluation of the extent to which students have progressed.
DISCONNECT WITH PUBLIC PERCEPTIONS OF “PROFICIENT”
One plan for the NAEP percent correct scores is to report them in association with the NAEP achievement levels. At the workshop, Roy Truby presented a document that showed how this might be accomplished based on results from the 1992 NAEP mathematics assessment (Johnson et al., 1997). An excerpt appears in Table 1. This table displays percent correct results for test takers in grades four, eight, and twelve. Column 2 presents the overall average percent correct for test-takers in each grade. Columns 3-5 show the percent correct scores for each achievement level category associated with the minimum score cutpoint for the category. For example, the cutpoint for the fourth grade advanced category (Column 3) would be associated with a score of 80 percent correct. A percent correct score of 33 percent would represent performance at the cutpoint for twelfth grade's basic category.
Speakers cautioned that the percent correct scale used in Table 1 is unlike that understood by the public. In their opinion, people typically regard 70% as a passing score; scores around 80% as indicating proficiency; and scores of 90% and above as advanced. What would members of the
TABLE 1 Example of Market Basket Results
(1) |
(2) |
Cut Points by Achievement Level |
||
Grade |
Average Percent Correct Scorea |
(3) |
(4) |
(5) |
Advanced |
Proficient |
Basic |
||
4 |
41% |
80% |
58% |
34% |
8 |
42 |
73 |
55 |
37 |
12 |
40 |
75 |
57 |
33 |
aIn terms of total possible points. Note: The information in Table 1 is based on simulations from the full NAEP assessment; results for a market basket might differ depending on its composition. |
general public think when they saw that the average American student scored less than 50% on the test represented in the table? Would this scheme be an appropriate basis for the public's evaluation of the level of education in schools today? According to one speaker:
Most test directors would understand why this might be, but no teacher, parent, or member of the public would consider 55% proficient. They would consider that score as representing “clueless” perhaps, and would think even less of the test and the educators that would purport to pass off 55% as proficient.
CONVERSION TO GRADES
While most Americans have at one time or another taken a test and received a percent score, generally that percent score was converted to a letter grade. Although associating percent correct scores with an achievement level might increase public understanding of NAEP, many people would still be tempted to convert the scores to letter grades, and their conversions might not be accurate. Richard Colvin offered his perspective as an education reporter for the Los Angeles Times:
On its own, a percent correct score is only slightly more meaningful than a scale score. The reason is that, in school, percent correct is translated into a grade: 93% or above for an “A,” 85% to 93% for a “B,” and so forth. If you were to put out a percent correct score for the market basket of items, I assure
you that journalists will push you to say what letter grade it represents. And, if you aren't willing to do that, journalists will do it for you.
Other participants echoed this concern, noting that the public would need a means for interpreting and evaluating percent correct scores.
ONE STEP FORWARD, TWO STEPS BACK
As described by Andrew Kolstad, senior technical advisor with NCES, in the first decade of NAEP, the percent correct metric was used for reporting results. Use of item response theory (IRT), beginning in the early 1980s, solved many of the interpretation problems that stemmed from the practice of reporting percent correct scores for subsets of items. Therefore, some workshop discussants wondered why NAEP would want to return to the metric used in its early years. David Thissen, professor of psychology at the University of North Carolina, emphasized this pointing out that “NAEP's use of the IRT scale in the past two decades has done a great deal to legitimize such IRT procedures with the result that many other assessments now use IRT scales. . . . [A] potential unintended consequence of NAEP reporting on a percent correct scale might be to drive many other tests, such as state assessments, to imitation.”
NAEP uses some of the most sophisticated and high-quality analytic and reporting methods available. If NAEP moves away from such procedures to a simpler percent correct metric, others will surely follow suit. Many discussants maintained that they did not see the benefits of the simpler metric.
DOMAIN REFERENCED REPORTING
During his comments on technical and measurement considerations, Don McLaughlin, chief scientist for the American Institutes of Research, reminded participants that the desired inferences about student achievement are about the content domain, not about the set of questions on a particular test form. The interest is not in the percent of items or points correct on a form. Instead, the interest is in the percent of the domain that children have mastered.
Domain referenced reporting was cited as an alternative to market-basket reporting. Domain referenced reporting is based on large collections of items that probe the domain with more breadth and depth than is
possible through a single administrable test form. As described by Darrell Bock, domain referenced reporting involves expressing scale scores in terms of the expected percent correct on a larger collection of items representative of the specified domain. The expected percents correct can be calculated for any given scale score using IRT methods and the estimated item parameters of the sample of test questions (see Bock et al., 1997). Bock further explained the concept of domain referenced reporting saying:
[A] domain sample for mathematics might consist of 240 items by selecting 4 items to represent each of the 60 cells of the domain specification described by [John] Mazzeo. These items could be drawn from previously released items from the NAEP assessment or from state testing programs. Their parameters could be estimated by adding a small number of additional examinees in each school participating in the [NAEP] and administering them special test forms containing small subsets of the domain sample, similar to those proposed for the market basket.
The point is to publish the 240 items in a compendium organized by the content, process, and achievement level categories. . . . For graded openended items, the rating categories should also be described and the “satisfactory” and “unsatisfactory” categories identified. The objective of this approach is not only to provide sufficient items from which readers of the assessment report can infer the knowledge and skills involved in mathematics achievement, but also, by publishing the compendium well before the assessment takes place, to encourage its use as a aid to instruction and self-study and as a basis for comment and explication in the media. When the results finally appear, there will then exist a ready and well-informed audience for the assessment report.
Bock went on to offer as an example of such a compendium the procedures used by the Federal Aviation Administration (FAA) to license private pilots. All 915 items that could potentially appear on the exam are published. And all potential pilots receive this compendium so that they may study the necessary material.