The National Assessment of Educational Progress (NAEP) has been providing policy makers, educators, and the public with reports on the academic performance and progress of the nation’s students since 1969. The assessment is given periodically in a variety of subjects: mathematics, reading, writing, science, the arts, civics, economics, geography, U.S. history, and technology and engineering literacy. NAEP is often referred to as The Nation’s Report Card because it reports on the educational progress of the nation as a whole. The assessment is not given to all students in the country, and scores are not reported for individual students. Instead, the assessment is given to representative samples of students across the United States, and results are reported for the nation and for specific groups of students.
Since 1983, the results have been reported as average scores on a scale ranging from 0 to 500. Until 1992, results were reported on this scale for the nation as a whole and for students grouped by sociodemographic characteristics, such as gender, race and ethnicity, and socioeconomic status. Beginning in 1993, results were reported separately by state, and, beginning in 2002, also for some urban school districts.
Over time, there has been growing interest in comparing educational progress across the states. At the same time, there has been increasing interest in having the results reported in a way that policy makers and the public could understand and so that they could be used to examine students’ achievement in relation to high, world-class standards. By 1989,
there was considerable support for changes in the way that NAEP results were reported.
In part in response to these interests, the idea of reporting NAEP results using achievement levels was first raised in the late 1980s. The Elementary and Secondary Education Act of 1988, which authorized the formation of the National Assessment Governing Board (NAGB), delegated to NAGB the responsibility of “identifying appropriate achievement goals” (P.L. 100-297, Part C, Sec. 3403(6)(A)). The decision to report NAEP results in terms of achievement levels was based on NAGB’s interpretation of this legislation.
In a 1990 policy statement, NAGB established three “achievement levels”: Basic, Proficient, and Advanced. The NAEP results would henceforth report the percentage of test takers at each achievement level. The percentage of test takers who scored below the Basic level would also be reported. These new reports would be in addition to summary statistics based on the score scale.
After a major standard setting process in 1992, NAEP began reporting results in relation to the three achievement levels. However, the use of achievement levels has provoked controversy and disagreement, and evaluators have identified numerous concerns. When NAEP was reauthorized in 1994, Congress stipulated that until an evaluation determined that the achievement levels are reasonable, reliable, valid, and informative to the public, they were to be designated as trial—a provisional status that still remains, 22 years later.
In 2014, the U.S. Department of Education, through its Institute of Education Sciences, sought to reexamine the need for this provisional status and contracted with the National Academy of Sciences to appoint a committee of experts to carry out that examination. The committee’s charge was to determine whether the NAEP achievement levels in reading and mathematics are reasonable, reliable, valid, and informative to the public. More specifically, it was to evaluate the student achievement levels used in reporting NAEP results, the procedures for setting those levels, and how they are used (see Chapter 1 for the complete charge).
In addressing its charge, the committee focused on process, outcomes, and uses. That is, we evaluated (1) the process for conducting the standard setting; (2) the technical properties of the outcomes of standard setting (the cut scores and the achievement-level descriptors or ALDs); and (3) the interpretations and uses of the achievement levels.
In developing achievement levels, NAGB first needed to set standards, a process that involves determining “how good is good enough”
in relation to one or more criterion measures. For instance, in education, one commonly used criterion is how good is good enough to attain an A; in employment settings, it is the minimum test score needed to become certified or licensed to practice in a given field (whether plumbing or medicine).
To set achievement standards for NAEP, two questions had to be answered: What skills and knowledge do students need in order to be considered Basic, Proficient, and Advanced in each subject area? What scores on the test indicate that a student has attained one of those achievement levels?
All standard setting is based on judgment. For a course grade, it is the judgment of the classroom teacher. For a licensure or certification test, it is the judgment of professionals in the field. For NAEP, it is more complicated. As a measure of achievement for a cross-section of U.S. students, NAEP’s achievement levels need to reflect common goals for student learning, despite the fact that students are taught according to curricula that vary across states and districts. To accommodate these differences, NAEP’s standards need to reflect a wide spectrum of judgments. Hence, NAGB sought feedback from a wide range of experts and stakeholders in setting the standards: educators, administrators, subject-matter specialists, policy makers, parent groups, and professional organizations, as well as the general public.
Through the standard setting process, NAGB adopted a set of achievement levels for each subject area and grade. The achievement levels include a description of the knowledge and skills necessary to perform at a Basic, Proficient, and Advanced level as well as the “cut score,” the minimum score needed to attain each achievement level.
FINDINGS AND CONCLUSIONS
In setting standards for the 1992 reading and mathematics assessments, NAGB broke new ground. Although standard setting has a long history, its use in the area of educational achievement testing—and to set multiple standards for a given assessment—was new. While the Standards for Educational and Psychological Testing in place at the time provided guidance for some aspects of the 1992 standard setting, many of the procedures used were novel and untested in the context of achievement testing for kindergarten through 12th grade (K-12).
In carrying out the process, NAGB sought advice and assistance from many measurement and subject-matter experts, including staff of the standard setting contractor, an advisory group of individuals with extensive
standard setting expertise, and NAGB’s own advisers. In addition, a panel of members of the National Academy of Education (NAEd) evaluated the work being done.
The NAEd evaluators raised questions about the integrity and validity of the process. Perhaps most importantly, they criticized the ALDs, arguing that they were not valid representations of performance at the specified levels. They also criticized the specific method for setting the cut scores, arguing that it was too cognitively complex, thus limiting the validity of the results.
In spite of the NAEd evaluators’ concerns, NAGB moved forward with achievement-level reporting for the 1992 assessments of mathematics and reading. Since then, NAGB and NCES have sponsored research conferences, sought advice from experts in standard setting, commissioned research, formed standing advisory groups, held training workshops, and published materials on standard setting.
For its review, the committee considered the Standards for Educational and Psychological Testing and guidance available in 1992, along with what is known now. In examining the process, we considered the ways in which panelists were selected and trained, how the method for setting the cut scores was selected and implemented, and how the ALDs were developed. Our key findings are as follows:
- The process for selecting standard setting panelists was extensive and, in our judgment, likely to have produced a set of panelists that represented a wide array of views and perspectives.
- In selecting a cut-score setting method, NAGB and ACT chose one method for the multiple-choice and short-answer questions and another for the extended-response questions. This was novel at the time and is now widely recognized as a best practice.
- NAEP’s 1992 standard setting represented the first time that formal, written ALDs were developed to guide the standard setting panelists. This, too, was novel at the time and is now widely recognized as a best practice.
CONCLUSION 3-1 The procedures used by the National Assessment Governing Board for setting the achievement levels in 1992 are well documented. The documentation includes the kinds of evidence called for in the Standards for Educational and Psychological Testing in place at the time and currently and was in line with the research and knowledge base at the time.
The standard setting process used for NAEP began with the frameworks (or blueprints for the mathematics and reading assessments), a general policy description of what each level is intended to represent (e.g., mastery over challenging subject matter), and a set of items that have been used to assess the knowledge and skills elaborated in the assessment frameworks. The standard setting process produces two key outcomes. The first outcome is a set of detailed ALDs, specifying the knowledge and skills required at each of the achievement levels. The second outcome is the cut score that indicates the minimum-scale score value for each achievement level. The achievement levels defined by these cut scores provide the basis for using and interpreting test results, and thus, the validity of test score interpretations hinges on the appropriateness of the cut score.
In evaluating these outcomes, the committee examined evidence of their reliability and validity. In the context of standard setting, reliability is a measure of the consistency, generalizability, and stability of the judgments (i.e., cut scores). Reliability estimates indicate the extent to which the cut-score judgments are likely to be consistent across replications of the standard setting, such as repeating the standard setting with different panelists, different test questions, on different occasions, or with different methods.
NAGB conducted studies to collect three kinds of reliability evidence: interpanelist agreement; intrapanelist consistency across items of different types; and the stability of cut scores across occasions. The actual values of the estimates of consistency suggest a considerable amount of variability in cut-score judgments. The available documentation notes that this issue received considerable attention, but the sources and the effects of the variability were not fully addressed before achievement-level results were reported. We are hesitant to make judgments about the rationale for decisions made long ago; at the same time, we acknowledge that some of these issues warranted further investigation.
CONCLUSION 4-1 The available documentation of the 1992 standard settings in reading and mathematics include the types of reliability analyses called for in the Standards for Educational and Psychological Testing that were in place at the time and those that are currently in place. The evidence that resulted from these analyses, however, showed considerable variability among panelists’ cut-score judgments: the expected pattern of decreasing variability among panelists across the rounds was not consistently achieved, and panelists’ cut-score estimates were not consistent over different item formats and different
levels of item difficulty. These issues were not resolved before achievement-level results were released to the public.
Validation in the context of standard setting usually consists of demonstrating that the proposed cut score for each achievement level corresponds to the ALD and that the achievement levels are set at a reasonable level, not too low or too high. Accordingly, studies were conducted to provide evidence of validity related to test content and relationships with external criteria.
Content-Related Validity Evidence
With regard to content-related validity evidence, the studies focused on the alignment between the ALDs and cut scores, the frameworks, and the test questions. For these studies, a second and sometimes third group of panelists were asked to review the ALDs and cut scores produced by the initial standard setting. As a result of these reviews, changes were made to the ALDs: some were suggested to NAGB by the panelists; others were made by NAGB.
Since 1992, changes have been made to the mathematics and reading frameworks, the assessment tasks, and the ALDs—most notably, in 2005 and 2009. With the exception of grade-12 mathematics in 2005, no changes have been made in the cut scores. Moreover, the grade-12 descriptors for mathematics were changed in 2005 and 2009, but related changes were not made to those for grades 4 and 8. Consequently, the final descriptors were not the ones that panelists used to set the cut scores.
CONCLUSION 5-1 The studies conducted to assess content validity are in line with those called for in the Standards for Educational and Psychological Testing in place in 1992 and currently in 2016. The results of these studies suggested that changes in the achievement-level descriptors (ALDs) were needed, and they were subsequently made. These changes may have better aligned the descriptors to the framework and exemplar items, but as a consequence, the final ALDs were not the ones used to set the cut scores. Since 1992, there have been additional changes to the frameworks, item pools, the assessments, and studies to identify needed revisions to the ALDs. But, to date, there has been no effort to set new cut scores using the most current ALDs.1
CONCLUSION 5-2 Changes in the National Assessment of Educational Progress mathematics frameworks in 2005 led to new achievement-level descriptors and a new scale and cut scores for the achievement levels at the 12th grade, but not for the 4th and 8th grades. These changes create a perceived or actual break between 12th-grade mathematics and 4th- and 8th-grade mathematics. Such a break is at odds with contemporary thinking in mathematics education, which holds that school mathematics should be coherent across grades.
Criterion-Related Validity Evidence
Criterion-related validity evidence usually consists of comparisons with other indicators of the content and skills measured by an assessment, in this case, other measures of achievement in reading and mathematics. The goal is to help to evaluate the extent to which achievement levels are reasonable and set at an appropriate level.
It can be challenging to identify and collect the kinds of data that are needed to provide evidence of criterion-related validity. ACT reports that document the validity of the achievement levels do not include results from any studies that compared NAEP achievement levels with external measures. It is not clear why NAGB did not pursue studies. In contrast, the NAEd reports include a variety of studies, such as comparisons with state assessments, international assessments, advanced placement tests, and college admissions tests. NAEd further conducted a special study in which 4th- and 8th-grade teachers classified their own students into the achievement-level categories by comparing the ALDs with the students’ classwork. We examined evidence from similar sources for our evaluation and consider it to be of high value in judging the reasonableness of the achievement levels.
Our comparisons reveal considerable correspondence between the percentages of students at NAEP achievement levels and the percentages on other assessments. These studies show that the NAEP achievement-level results (the percentage of students at the advanced level) are generally consistent with the percentage of U.S. students scoring at the reading and mathematics benchmarks on the Programme for International Student Assessment, the mathematics benchmarks on Trends in International Mathematics and Science, and at the higher levels for AP exams. These studies also show that significant numbers of students in other countries score at the equivalent of the NAEP Advanced level.
CONCLUSION 5-3 The Standards for Educational and Psychological Testing in place in 1992 did not explicitly call for
criterion-related validity evidence for achievement-level setting, but such evidence was routinely examined by testing programs. The National Assessment Governing Board did not report information on criterion-related evidence to evaluate the reasonableness of the cut scores set in 1992. The National Academy of Education evaluators reported four kinds of criterion-related validity evidence, and they concluded that the cut scores were set very high. We were not able to determine whether this evidence was considered when the final cut scores were adopted for the National Assessment of Educational Progress.
Recent research has focused on validity evidence based on relationships with external variables, that is, setting benchmarks on NAEP that are related to concurrent or future performance on measures external to NAEP. The findings from this research can be used to evaluate the validity of new interpretations of the existing achievement levels, suggest possible adjustments to the cut scores or descriptors, or enhance understanding and use of the achievement-level results. This research can also help establish specific benchmarks that are separate from the existing achievement levels, such as college readiness.
CONCLUSION 5-4 Since the National Assessment of Educational Progress (NAEP) achievement levels were set, new research has investigated the relationships between NAEP scores and external measures, such as academic preparedness for college. The findings from this research can be used to evaluate the validity of new interpretations of the existing performance standards, suggest possible adjustments to the cut scores or descriptors, and or enhance understanding and use of the achievement-level results. This research can also help establish specific benchmarks that are separate from the existing achievement levels. This type of research is critical for adding meaning to the achievement levels.
Interpretation and Use
Originally, NAEP was designed to measure and report what U.S. students actually know and are able to do. However, the achievement levels were designed to lay out what U.S. students should know and be able to do. That is, the adoption of achievement levels added an extra layer of reporting to reflect the nation’s aspirations for students. Reporting NAEP results as the percentage of students who scored at each achievement level was intended to make NAEP results more understandable. This type of
reporting was designed to clearly and succinctly highlight the extent to which U.S. students are meeting expectations.
The committee was unable to find any official documents that provide guidance on the intended interpretations and uses of NAEP achievement levels, beyond the brief statements in two policy documents. The committee was also unable to find documents that specifically lay out appropriate uses and the associated research to support these uses. We found a disconnect between the kind of validity evidence that has been collected and the kinds of interpretations and uses that are made of NAEP’s reported results. That is, although the committee found evidence for the integrity and accuracy of the procedures used to set the achievement levels, the evidence does not extend to the uses of the achievement levels—the way that NAEP audiences use the results and the decisions they base on them.
The committee found that considerable information is provided to state and district personnel and the media in preparation for a release of NAEP results, and NAGB provided us with examples of these materials. However, this type of information was not easy to find on the NAEP Website.
The many audiences for NAEP achievement levels use them in a variety of ways, including to inform public discourse and policy decisions, as was the original intention. However, interpretive guidance provided to users is inconsistent and fragmented. Some audiences receive considerable guidance just prior to a release of results. For audiences that obtain most of their information from the Website or hard-copy reports for the general public, interpretative guidance is difficult to locate. Without appropriate guidance, misuses are likely. The committee found numerous types of inappropriate inferences.
CONCLUSION 6-1 The National Assessment of Educational Progress achievement levels are widely disseminated to and used by many audiences, but the interpretive guidance about the meaning and appropriate uses of those levels provided to users is inconsistent and piecemeal. Without appropriate guidance, misuses are likely.
CONCLUSION 6-2 Insufficient information is available about the intended interpretations and uses of the achievement levels and the validity evidence that support these interpretations and uses. There is also insufficient information on the actual interpretations and uses commonly made by the National Assessment of Educational Progress’s various audiences and little evidence to evaluate the validity of any of them.
CONCLUSION 6-3 The current achievement-level descriptors may not provide users with enough information about what students at a given level know and can do. The descriptors do not clearly provide accurate and specific information about the things that students at the cut score for each level know and can do.
Setting New Standards: Considerations
The committee recognizes that the achievement levels are a well-established part of NAEP, with wide influence on state K-12 achievement tests. Making changes to something that has been in place for more than 24 years would likely have a range of consequences that cannot be anticipated. We also recognize the difficulties that might be created by setting new standards, particularly the disruptions that would result from breaking people’s interpretations of the trends. We also note that during their 24 years they have acquired meaning for NAEP’s various audiences and stakeholders: they serve as stable benchmarks for monitoring achievement trends, and they are widely used to inform public discourse and policy decisions. Users regard them as a regular, permanent feature of NAEP reports.
To date, the descriptors for grade-12 mathematics and grade-4, -8, and -12 reading have been revised and updated as recently as 2009, but no changes have been made to the descriptors for mathematics in grades 4 and 8 since 2004.
We considered several courses of action, ranging from recommending no changes to recommending a new standard setting. We concluded that most of the strongest criticisms of the current standards—and the argument for completely new standards—can be addressed instead by revision of the ALDs.
CONCLUSION 7-1 The cut scores for grades 4 and 8 in mathematics and all grades in reading were set more than 24 years ago. Since then, there have been many adjustments to the frameworks, item pools, assessments, and achievement-level descriptors, but there has been no effort to set new cut scores for these assessments. Although priority has been given to maintaining the trend lines, it is possible that there has been “drift” in the meaning of the cut scores such that the validity of inferences about trends is questionable. The situation for grade-12 mathematics is similar, although possibly to a lesser extent since the
cut scores were set more recently (in 2005) and, thus far, only one round of adjustments has been made (in 2009).2
CONCLUSION 7-2 Although there is evidence to support conducting a new standard setting at this time for all grades in reading and mathematics, setting new cut scores would disrupt the National Assessment of Educational Progress trend line at a time when many other contextual factors are changing. In the short term, the disruption in the trend line could be avoided by continuing to follow the same cut scores but ensuring the descriptions are aligned with them. In particular, work is needed to ensure that the mathematics achievement-level descriptors (ALDs) for grades 4 and 8 are well aligned with the framework, cut scores, and item pools. Additional work to evaluate the alignment of the items and the ALDs for grade-4 reading and grade-12 mathematics is also needed. This work should not be done piecemeal, one grade at a time; rather, it should be done in a way that maintains the continuum of skills and knowledge across grades.3
The panel’s charge included the question of reasonableness. NAEP and its achievement levels loom large in public understanding of critical debates about education, excellence, and opportunity. One can fairly argue that The Nation’s Report Card is a success for that reason alone. Through 25 years of use, the NAEP achievement levels have acquired a “use validity” or reasonableness by virtue of familiarity.
In the long term, we recommend a thorough revision of the ALDs that are informed by a suite of education, social, and economic outcomes important to key audiences. We envision a set of descriptions that correspond to a few salient outcomes, such as college readiness or international comparisons. The studies we recommend, however, would also offer ways to characterize other scale score points. This information should be available to the public along with test item exemplars. The more audiences understand the scale scores, the less likely they are to misuse the achievement levels.
Setting new cut scores at this time, when so many things are in flux,
would likely create considerable confusion about their meaning. We do not encourage a new standard setting at this time. However, we note that at some point the balance of concerns will tip to favor new standard setting procedures. There will be evolution in the methodology, assessment frameworks, the technology of test administration and hence the nature of items, and more. We suggest that the U.S. Department of Education state an intention to revisit this issue in some stated number of years. We offer specific recommendations below.
RECOMMENDATION 1 Alignment among the frameworks, item pools, achievement-level descriptors, and cut scores is fundamental to the validity of inferences about student achievement. In 2009, alignment was evaluated for all grades in reading and for grade 12 in mathematics, and changes were made to the achievement-level descriptors, as needed. Similar research is needed to evaluate alignment for the grade-4 and grade-8 mathematics assessments and to revise them as needed to ensure that they represent the knowledge and skills of students at each achievement level. Moreover, additional work to verify alignment for grade-4 reading and grade-12 mathematics is needed.4
RECOMMENDATION 2 Once satisfactory alignment among the frameworks, item pools, achievement-level descriptors, and cut scores in National Assessment of Educational Progress mathematics and reading has been demonstrated, their designation as trial should be discontinued. This work should be completed and the results evaluated as stipulated by law:5 (20 U.S. Code 9622: National Assessment of Educational Progress: https://www.law.cornell.edu/uscode/text/20/9622 [November 2016].
RECOMMENDATION 3 To maintain the validity and usefulness of achievement levels, there should be regular recurring reviews of the achievement-level descriptors, with updates as needed, to ensure that they reflect both the frameworks and the incorporation of those frameworks in National Assessment of Educational Progress assessments.6
RECOMMENDATION 4 Research is needed on the relationships between the National Assessment of Educational Progress (NAEP) achievement levels and concurrent or future performance on measures external to NAEP. Like the research that led to setting scale scores that represent academic preparedness for college, new research should focus on other measures of future performance, such as being on track for a college-ready high school diploma for 8th-grade students and readiness for middle school for 4th-grade students.7
RECOMMENDATION 5 Research is needed to articulate the intended interpretations and uses of the achievement levels and to collect validity evidence to support these interpretations and uses. In addition, research is needed to identify the actual interpretations and uses commonly made by the National Assessment of Educational Progress’s various audiences and to evaluate the validity of each of them. This information should be communicated to users with clear guidance on substantiated and unsubstantiated interpretations.8
RECOMMENDATION 6 Guidance is needed to help users determine inferences that are best made with achievement levels and those best made with scale score statistics. Such guidance should be incorporated in every report that includes achievement levels.9
A number of aspects of the NAEP reading and mathematics assessments have changed since 1992: the constructs and frameworks; the types
of items, including more constructed-response questions; the ways of reporting results; and the addition of innovative Web-based data tools. NAEP data have also been used in new ways over the past 24 years, such as reporting results for urban districts, including NAEP in federal accountability provisions, and setting academic preparedness scores. New linking studies have made it possible to interpret NAEP results in terms of the results of international assessments, and there are possibilities for linking NAEP 4th- and 8th-grade results to indicating being on track for future learning. Although external to NAEP, major national initiatives have significantly altered state standards in reading and mathematics.
These and other factors imply a changing context for NAEP. Staying current with contemporary practices and issues while also maintaining the trend line for NAEP results are competing goals.
RECOMMENDATION 7 The National Assessment of Educational Progress (NAEP) should implement a regular cycle for considering the desirability of conducting a new standard setting. Factors to consider include, but are not limited to substantive changes in the constructs, item types, or frameworks; innovations in the modality for administering assessments; advances in standard setting methodologies; and changes in the policy environment for using NAEP results. These factors should be weighed against the downsides of interrupting the trend data and information.10