The committee’s charge was to develop a set of criteria with which to evaluate NAEP achievement levels. We were asked to address the question: To what extent are the NAEP achievement levels reasonable, reliable, valid, and informative to the public? To respond to this charge, we developed a set of questions (see Box 1-2, in Chapter 1) and organized our data gathering and analysis around them:
- Why was achievement-level reporting implemented? What was it intended to accomplish?
- What validity evidence exists that demonstrates these interpretations and uses are supportable?
- Was the overall process for determining achievement levels—their descriptions, the designated levels (Basic, Proficient, Advanced), and cut scores—reasonable and sensible?
- Did the process yield a reasonable set of cut scores?
- What questions do stakeholders want and need NAEP to answer? Do achievement-level reports respond to these wants and needs better than reports on other metrics (e.g., summaries of scale scores)?
- What guidance is provided to help users interpret achievement-level results?
Our analysis considered both the Standards for Educational and Psychological Testing in place in 1992 (American Educational Research Association et al., 1985) (hereafter, Standards) and the most current version (2014).
Process for Setting Achievement Levels
In setting standards for the 1992 reading and mathematics assessments, the National Assessment Governing Board (NAGB) broke new ground. Although standard setting has a long history, its use in 1992 in the area of educational achievement testing—and to set multiple standards for a given assessment with multiple response formats—was new.
In examining the process, we considered the ways in which panelists for the standard settings were selected and trained, how the method for setting the cut scores was selected and implemented, and how the achievement-level descriptors (ALDs) were developed. We note our key findings:
- The process for selecting standard setting panelists was extensive and, in our judgment, likely to have produced a set of panelists that represented a wide and appropriate array of views and perspectives.
- In selecting a cut-score setting method, NAGB and ACT chose one method for the multiple-choice and short-answer questions and another for the extended-response questions. This was novel at the time and is now widely recognized as a best practice.
- NAEP’s 1992 standard setting represented the first time that formal, written ALDs were developed to guide the standard setting panelists. This, too, was novel at the time and is now widely recognized as a best practice.
CONCLUSION 3-1 The procedures used by the National Assessment Governing Board for setting the achievement levels in 1992 are well documented. The documentation includes the kinds of evidence called for in the Standards for Educational and Psychological Testing in place at the time and currently and was in line with the research and knowledge base at the time.
Reliability of Achievement Levels
Three kinds of reliability evidence were reported by NAGB and ACT: interpanelist agreement, intrapanelist consistency across items of different types, and the stability of cut scores across occasions. The materials ACT prepared to document the standard settings were very detailed and included the kinds of reliability information one would expect to see. A considerable number of in-depth analyses was conducted and reported, which represented the kinds of analyses suggested by the Standards and
best practice at the time (American Educational Research Association et al., 1985). Although the Standards indicate the types of data and analyses that should be examined and reported, they do not go so far as to specify acceptable values, both because acceptable values are context specific and because there is not consensus among measurement experts on what the values should be. Questions were raised about the findings, particularly about the indicators of intrapanelist consistency: they were not answered before the 1992 mathematics results were reported.
The documentation indicates that ACT and NAGB spent considerable time wrestling with these issues and solicited advice from their consultant teams; they concluded that more research was needed, and, as noted in Chapters 5 and 6, the evaluators from the National Academy of Education (NAEd) who studied this issue recommended additional studies before reporting achievement-level results.
NAGB chose not to conduct additional studies; it did lower the cut scores for mathematics by 1 standard error, but it accepted the cut scores for reading. Both sets of results were reported in 1993. These decisions were the subject of much debate, some of which is characterized in publicly available reports (see, e.g., Bourque, 2009; National Research Council, 1999; Shepard et al., 1993; Vinovskis, 1998). The committee attempted to recreate the discussions around these decisions, but ultimately decided that attempts to characterize events that occurred more than 25 years ago would not likely capture the complexity of the full discussions and deliberations. The committee was hesitant to make judgments about the rationale for decisions made long ago; at the same time, we acknowledge that some of these issues warranted—and still warrant—further investigation.
CONCLUSION 4-1 The available documentation of the 1992 standard settings in reading and mathematics include the types of reliability analyses called for in the Standards for Educational and Psychological Testing that were in place at the time and those that are currently in place. The evidence that resulted from these analyses, however, showed considerable variability among panelists’ cut-score judgments: the expected pattern of decreasing variability among panelists across the rounds was not consistently achieved; and panelists’ cut-score estimates were not consistent over different item formats and different levels of item difficulty. These issues were not resolved before achievement-level results were released to the public.
Validity of Achievement Levels
To evaluate the validity of the achievement levels, the committee examined both content-related and criterion-related validity evidence. We note that validation is an ongoing process, it does not stop after one or two studies are completed. Moreover, it does not yield an unequivocal yes or no answer, such as that a test is or is not valid. Collecting and evaluating validity evidence is continual: new sources with new kinds of data and new ways to analyze them produce new evidence. At any time, evidence about validity may be strengthened or refuted as new findings are reported. Therefore, we examined evidence that was collected in 1992 for the NAEP standard setting and the conclusions drawn about it; we also examined evidence that has been collected since then and considered how it might affect those conclusions.
ACT conducted several studies to obtain feedback from content-area experts regarding the validity of ALDs and exemplar items, and the content experts who participated in these studies suggested changes to both the descriptors and the items. The descriptors were further revised by NAGB, and those final, official versions are quite different from the ones used for setting the cut scores (see the Annex to Chapter 5). There were differences of opinion on the extent to which the final descriptions were aligned with the framework, the item pool, and contemporary thinking in mathematics and reading subject matter.
Since 1992, changes have been made to the mathematics and reading frameworks, the assessment tasks, and the ALDs, most notably, in 2005 and 2009. With the exception of grade-12 mathematics in 2005, no changes have been made in the cut scores. Moreover, the grade-12 descriptors for mathematics were changed in 2005 and 2009, but related changes were not made to those for grades 4 and 8, suggesting a break in the continuum of skills across the grades.1
On these issues, we draw two conclusions.
CONCLUSION 5-1 The studies conducted to assess content validity are in line with those called for in the Standards for Educational and Psychological Testing in place in 1992 and currently in 2016. The results of these studies suggested that changes in the achievement-level descriptors (ALDs) were needed, and they were subsequently made. These changes may have better aligned the descriptors to the framework and exemplar items, but as a consequence, the final ALDs were
not the ones used to set the cut scores. Since 1992, there have been additional changes to the frameworks, the item pools, the assessments, and studies to identify needed revisions to the ALDs. But, to date, there has been no effort to set new cut scores using the most current ALDs.2
CONCLUSION 5-2 Changes in the National Assessment of Educational Progress mathematics frameworks in 2005 led to new achievement-level descriptors and a new scale and cutscores for the achievement levels at the 12th grade, but not for the 4th and 8th grades. These changes create a perceived or actual break between 12th-grade mathematics and 4th- and 8th-grade mathematics. Such a break is at odds with contemporary thinking in mathematics education, which holds that school mathematics should be coherent across grades.
Criterion-related validity evidence usually consists of comparisons with other indicators of the content and skills measured by the assessment, in this case, other measures of achievement in reading and mathematics. The goal is to help to evaluate the extent to which achievement levels are reasonable and set at an appropriate level.
It can be challenging to identify and collect the kinds of data that are needed to evaluate criterion-related validity. The ACT reports that document the validity of the achievement levels do not include results from any studies that compared NAEP achievement levels to external measures. It is not clear why NAGB did not pursue such studies. In contrast, the NAEd reports include a variety of such studies, such as comparisons with state assessments, international assessments, advanced placement (AP) tests, and college admissions tests. NAEd also conducted a special study in which 4th- and 8th-grade teachers classified their own students into the achievement-level categories by comparing ALDs with the students’ classwork. We examined evidence from similar sources for our evaluation and consider it to be of high value in judging the reasonableness of achievement levels.
CONCLUSION 5-3 The Standards for Educational and Psychological Testing in place in 1992 did not explicitly call for criterion-related validity evidence for achievement-level setting, but such evidence was routinely examined by testing programs. The National Assessment Governing Board did not
report information on criterion-related evidence to evaluate the reasonableness of the cut scores set in 1992. The National Academy of Education evaluators reported four kinds of criterion-related validity evidence, and they concluded that the cut scores were set very high. We were not able to determine whether this evidence was considered when the final cut scores were adopted for the National Assessment of Educational Progress.
More recently, there has been a focus on external validity and predictive validity: that is, setting benchmarks on NAEP that are related to concurrent or future performance on measures external to NAEP. Through these studies, NAGB has identified scale scores on the grade-12 mathematics and reading assessments that indicate academic preparedness for college. Some of the new research has investigated the relationships between NAEP scores and external criterion measures, such as college performance. The findings from this research can be used to evaluate the validity of new interpretations of the existing performance standards, suggest possible adjustments to the cut scores or descriptors, and enhance understanding and use of the achievement-level results. Such research can also help establish specific benchmarks that are separate from the existing achievement levels. This type of research is critical for full understanding of the meaning of the achievement levels.
CONCLUSION 5-4 Since the National Assessment of Educational Progress (NAEP) achievement levels were set, new research has investigated the relationships between NAEP scores and external measures, such as academic preparedness for college. The findings from this research can be used to evaluate the validity of new interpretations of the existing performance standards, suggest possible adjustments to the cut scores or descriptors, and or enhance understanding and use of the achievement-level results. This research can also help establish specific benchmarks that are separate from the existing achievement levels. This type of research is critical for adding meaning to the achievement levels.
Interpretation and Use
Originally, NAEP was designed to measure and report what students in the United States actually know and be able to do. The adoption of achievement levels added an extra layer of reporting to reflect the nation’s aspirations for students. Achievement levels were designed to lay out what students should know and be able to do. Reporting NAEP results
as the percentage of students who score at each achievement level was intended to make NAEP results more understandable. This type of reporting was designed to clearly and succinctly highlight the extent to which U.S. students are meeting the nation’s expectations.
For this aspect of its evaluation, the committee looked for information on both the intended and the actual uses and users of NAEP achievement-level reports. We also considered the extent to which research evidence supports these uses.
On the first point, intended use, the committee was unable to find any official documents that provide guidance on the intended interpretations and uses of NAEP achievement levels, other than the brief statements in policy documents in 1990, 1993, and 1995. The committee was also unable to find documents that specified appropriate uses and the associated research to support those uses. Considerable information is provided to state and district personnel and members of the press in preparation for a release of NAEP results, and NAGB provided us with examples of these materials. But even this type of information is not readily accessible on the NAEP Website.
On the second point, actual use, we found that there are a variety of audiences for achievement-level reports, and these audiences use the information in a variety of ways. Achievement levels were initially developed to inform public discourse and policy decisions, and we found evidence that they are being used in this way.
There is a disconnect between the kind of validity evidence that has been collected about intended uses and the many actual interpretations and uses of NAEP data. The validity evidence documents the integrity and accuracy of the procedures used to set the achievement levels, but the evidence does not extend to the actual uses—the way that NAEP audiences use the results and the decisions they base on them.
Evidence suggests that the achievement levels are widely used by the public. However, interpretive guidance is provided to users in an inconsistent and piecemeal way. Some audiences receive considerable guidance just prior to a release of results. For audiences that obtain most of their information from the Website or hard-copy reports for the general public, interpretative guidance is difficult to locate. Without appropriate guidance, misuses are likely.
CONCLUSION 6-1 The National Assessment of Educational Progress achievement levels are widely disseminated to and used by many audiences, but the interpretive guidance about the meaning and appropriate uses of those levels provided to users is inconsistent and piecemeal. Without appropriate guidance, misuses are likely.
CONCLUSION 6-2 Insufficient information is available about the intended interpretations and uses of the achievement levels and the validity evidence that support these interpretations and uses. There is also insufficient information on the actual interpretations and uses commonly made by the National Assessment of Educational Progress’s various audiences and little evidence to evaluate the validity of any of them.
CONCLUSION 6-3 The current achievement-level descriptors may not provide users with enough information about what students at a given level know and can do. The descriptors do not clearly provide accurate and specific information about the things that students at the cut score for each level know and can do.
In this report, we explore a number of important issues with the NAEP achievement levels established in 1992. Researchers and others have raised questions about the standard setting methodology, the validity of the ALDs, and the appropriateness of the cut scores. The committee finds some of these criticisms to be compelling and to bear on our judgments of the extent to which the achievement levels are valid, reliable, reasonable, and informative.
The committee recognizes that the achievement levels—although labeled as “trial”—are a well-established part of NAEP, with wide influence on state K-12 achievement tests. Making changes to something that has been in place for more than 24 years would likely have a multitude of consequences. We understand the difficulties that might be created by setting new standards, particularly the disruptions that would result from breaking trend-line information. We think this would be unwise at a time when so many other things are in flux—many states are transitioning to the Common Core State Standards and implementing the associated assessments; many other states are transitioning to Common Core–like standards and assessments; and NAEP is moving toward digital assessment with new item types. Congress recently reauthorized the Elementary and Secondary Education Act (the Every Student Succeeds Act); it is not yet known what effects this may have on NAEP’s role in relation to state achievement testing.
The committee considered several courses of action, ranging from
recommending no changes to recommending a completely new standard setting. The country has been tracking progress on three cut scores over the past 24 years. These points are labeled and defined through the ALDs, and those descriptors can be revised and updated without conducting a completely new standard setting. To date, the descriptors for grade-12 mathematics and grades-4, -8, and -12 reading were revised and updated in 2009. However, the “anchor studies” (discussed in Chapter 4) show that 27 percent of grade-4 reading items and 21 percent of the grade-12 mathematics items did not “anchor” to an achievement-level category, suggesting some misalignment between items and ALDs. Moreover, no changes have been made to the descriptors for mathematics in grades 4 and 8 since 2004.
It is important that the descriptors be well aligned with other parts of the system (framework, items, cut scores) and convey accurate information about the meaning of the achievement levels. For the time being, the disruption in the trend line could be avoided by continuing to use the same cut scores but revising the descriptions of them, as needed. Weighing the options, we conclude that responding to most of the significant arguments in favor of setting new standards can be addressed instead by revision of the ALDs, most importantly for grade-4 and -8 mathematics. Additional work to evaluate the alignment of the items and the ALDs for grade-4 reading and grade-12 mathematics is also needed.
CONCLUSION 7-1 The cut scores for grades 4 and 8 in mathematics and all grades in reading were set more than 24 years ago. Since then, there have been many adjustments to the frameworks, item pools, assessments, and achievement-level descriptors, but there has been no effort to set new cut scores for these assessments. Although priority has been given to maintaining the trend lines, it is possible that there has been “drift” in the meaning of the cut scores such that the validity of inferences about trends is questionable. The situation for grade-12 mathematics is similar, although possibly to a lesser extent because the cut scores were set more recently (in 2005) and, thus far, only one round of adjustments has been made (in 2009).4
CONCLUSION 7-2 Although there is evidence to support conducting a new standard setting at this time for all grades in reading and mathematics, setting new cut scores would disrupt the National Assessment of Educational Progress trend line
at a time when many other contextual factors are changing. In the short term, the disruption in the trend line could be avoided by continuing to follow the same cut scores but ensuring the descriptions are aligned with them. In particular, work is needed to ensure that the mathematics achievement-level descriptors (ALDs) for grades 4 and 8 are well aligned with the framework, cut scores, and item pools. Additional work to evaluate the alignment of the items and the ALDs for grade-4 reading and grade-12 mathematics is also needed. This work should not be done piecemeal, one grade at a time; rather, it should be done in a way that maintains the continuum of skills and knowledge across grades.5
NAEP and its achievement levels loom large in public understanding of critical debates about education, excellence, and opportunity. One can fairly argue that The Nation’s Report Card is a success for that reason alone. Through so many years of use, the NAEP achievement levels have acquired a “use validity” or reasonableness by virtue of familiarity.
In the long term, we recommend a thorough revision of the ALDs that are informed by a suite of education, social, and economic outcomes important to key audiences. We envision a set of descriptions that correspond to a few salient outcomes, such as college readiness or international comparisons. The studies we recommend, however, would also offer ways to characterize other scale score points. This information should be available to the public along with test item exemplars. The more audiences understand the scale scores, the less likely they are to misuse the achievement levels.
Setting new cut scores at this time, when so many things are in flux, would likely create considerable confusion about their meaning. We do not encourage a new standard setting at this time. However, we note that at some point the balance of concerns will tip to favor new standard setting procedures. There will be evolution in the methodology, assessment frameworks, the technology of test administration and hence the nature of items, and more. We suggest that the U.S. Department of Education state an intention to revisit this issue in some stated number of years. We offer specific recommendations below.
RECOMMENDATION 1 Alignment among the frameworks, item pools, achievement-level descriptors (ALDs), and the cut scores is fundamental to the validity of inferences about student achievement. In 2009, alignment was evaluated for all grades in reading and for grade 12 in mathematics, and changes were made to the ALDs, as needed. Similar research is needed to evaluate alignment for the grade-4 and grade-8 mathematics assessments and to revise them as needed to ensure that they represent the knowledge and skills of students at each achievement level. Moreover, additional work to verify alignment for grade-4 reading and grade-12 mathematics is needed.6
RECOMMENDATION 2 Once satisfactory alignment among the frameworks, the item pools, the achievement-level descriptors, and the cut scores in National Assessment of Educational Progress mathematics and reading has been demonstrated, their designation as trial should be discontinued. This work should be completed and the results evaluated as stipulated by law: 20 U.S. Code 9622: National Assessment of Educational Progress (https://www.law.cornell.edu/uscode/text/20/9622 [September 2016]).
RECOMMENDATION 3 To maintain the validity and usefulness of achievement levels, there should be regular recurring reviews of the achievement-level descriptors, with updates as needed, to ensure they reflect both the frameworks and the incorporation of those frameworks in National Assessment of Educational Progress assessments.7
The notion of Proficient (or Basic or Advanced) is abstract. It is defined through the ALDs and explicated through sample test questions. But this definition and explication connects its meaning only to the assessment and the framework, not to real-world measures that hold value to the public. When a doctor is licensed or an accountant is certified, it is understood that the person is ready to practice medicine or accounting. When someone is judged to be proficient in reading or mathematics, the obvious question is “for what?”
For NAEP, the answer tends to be defined by the user, and that answer
may not always be what was intended. Instead, it would be valuable for proficient to be linked to some real-world measure, such as relating 12th-grade reading and mathematics performance to college readiness, which would provide concrete meaning and connect the results to something the public values.
RECOMMENDATION 4 Research is needed on the relationships between the National Assessment of Educational Progress (NAEP) achievement levels and concurrent or future performance on measures external to NAEP. Like the research that led to setting scale scores that represent academic preparedness for college, new research should focus on other measures of future performance, such as being on track for a college-ready high school diploma for 8th-grade students and readiness for middle school for 4th-grade students.8
Actions are needed to improve the interpretation and use of NAEP reports, maintain the validity and usefulness of NAEP data, and ensure the currency of the NAEP achievement levels. The first step is to develop more concrete guidance for users on appropriate and inappropriate interpretations of achievement levels, to avoid NAEP’s audiences attaching their own understandings to them. There needs to be a validity argument that connects actual uses to research that documents the uses are appropriate. However, regardless of the intended interpretations and uses, users will not make them without appropriate guidance, instructions, and caveats.
Users need to be reminded of the unique characteristics of NAEP and its achievement levels, including NAEP does not report results for individual students; the achievement levels reflect performance, not students; there is no mention of “at grade level” performance in the achievement levels and, hence, the Proficient level does not reflect “at grade” performance nor is it synonymous with “proficiency” in the subject; the Basic level is less than full mastery but more than minimal competency; and even the best students may not meet the specified requirements for the Advanced level.
These caveats do not routinely appear in either hard-copy or electronic reports of NAEP achievement-level results. Reporting on the percentage of students who are proficient or advanced on 4th-grade reading, for instance, invites a grade-level interpretation of the results. Interpretative guidance and reporting strategies are needed to help users better
understand the results. Such guidance includes increasing the number of exemplar items to explicate what performance at a given level means, as well as the use of new reporting mechanisms.
RECOMMENDATION 5 Research is needed to articulate the intended interpretations and uses of the achievement levels and to collect validity evidence to support these interpretations and uses. In addition, research is needed to identify the actual interpretations and uses commonly made by the National Assessment of Educational Progress’s various audiences and to evaluate the validity of each of them. This information should be communicated to users with clear guidance on substantiated and unsubstantiated interpretations.9
Since 1983, NAEP results have been reported using statistics based on the scale score metric, including the mean, median, mode, standard deviation, and percentiles. Achievement levels were adopted later as an additional reporting device to serve specific communication and policy purposes. Reporting the percentage of students who score at each achievement level (or the percentage above a certain cut score) was intended to make the results more understandable to all audiences. Reporting results as the percentage at or above a given cut score is suitable for some inferences; for others, it can be misleading. Users need guidance to help them decide on the metric to use, that is, when to use the percentages at or above an achievement level and when to use statistics based on the scale score.
RECOMMENDATION 6 Guidance is needed to help users determine inferences that are best made with achievement levels and those best made with scale score statistics. Such guidance should be incorporated in every report that includes achievement levels.10
Finally, a number of aspects of the NAEP reading and mathematics assessments have changed since 1992. There have been changes in constructs and frameworks; new types of items have been used, including more constructed-response questions; the ways of reporting results have changed; and innovative Web-based data tools have been designed.
In addition, NAEP data have been used in new ways, such as reporting results for urban districts, including NAEP in federal accountability provisions (such as the No Child Left Behind Act), and establishing college readiness benchmarks. New linking studies have interpreted NAEP results in terms of the results of international assessments, with possibilities for linking NAEP 4th- and 8th-grade results to indicate being on track for future learning. External, but connected to NAEP, major national initiatives have significantly altered state standards in reading and mathematics.
These and other factors imply a changing context for NAEP. Staying current with contemporary practices and issues while also maintaining the trend line for NAEP results are competing goals. NAGB and the National Center for Education Statistics should periodically evaluate the advantages and disadvantages of each.
RECOMMENDATION 7 The National Assessment of Educational Progress (NAEP) should implement a regular cycle for considering the desirability of conducting a new standard setting. Factors to consider include, but are not limited to, substantive changes in the constructs, item types, or frameworks; innovations in the modality for administering assessments; advances in standard setting methodologies; and changes in the policy environment for using NAEP results. These factors should be weighed against the downsides of interrupting the trend data and information.11