Beaton (1990, p. 165) coined the phrase “When measuring change, do not change the measure.” This maxim is a guiding rule for NAEP. Since NAEP’s primary purpose is to track student progress over time, maintaining the integrity of the trend line is essential. But education is not static: content areas mature, curricula and teaching strategies evolve, assessment methods adapt, and the policy contexts for using assessments change as priorities change.
Staying current, while also maintaining the stability of a trend line, is difficult. In recognition of this tension, NAEP established two different programs: main NAEP and long-term NAEP, as described in Chapter 1. Main NAEP adjusts as needed so that it reflects current thinking about content areas, assessment strategies, and policy priorities, but efforts are made to incorporate these changes in ways that do not disrupt the trend line. In contrast, long-term trend NAEP provides a way to examine achievement trends with a measure that does not change; these assessments do not incorporate advances in content areas or assessment strategies.
There have been numerous changes in main NAEP since the reading and mathematics achievement levels were first set in 1992. The frameworks for main NAEP have been adjusted to reflect a variety of developments: new understandings of the content areas that NAEP measures; changing thinking in the field about the content; modifications in the proportions of questions that assess certain strands or specific skills; changes in response formats; and how students demonstrate what they know and can do (e.g.,
from choosing among provided options in a selected-response format to composing one’s own answer in a constructed-response format).1
At the same time, NAEP data have been used in new ways, such as reporting results for urban districts—including NAEP in federal accountability provisions such as the No Child Left Behind Act (NCLB); and establishing college readiness benchmarks. New linking studies allow for interpretation of NAEP results in terms of the results of international assessments with possibilities for linking NAEP 4th- and 8th-grade results to determine whether students’ progress is on track for future learning (see Chapter 5). External (but connected to NAEP) major national initiatives have significantly altered states’ standards in reading and mathematics. These include NCLB and the not-yet-settled transition to the Common Core State Standards in Mathematics, as well as those for English-language arts. The ways of reporting results have changed, and innovative Web-based data tools have been designed.
NAGB and NCES are now exploring a transition to computer-based assessment for all of NAEP. As part of this process, they are conducting research to evaluate the extent to which the mode of presentation changes the constructs being measured.
Changes in the constructs of reading or mathematics represent a threat to one of NAEP’s primary purposes: to track trends in achievement over time. NAEP has always worked hard to maintain the constructs so that changes in NAEP scores can be attributed to changes in achievement and not to changes in content or construct. When changes seem necessary, NAEP has to make choices, balancing options that will allow the assessment to remain current with contemporary thinking while minimizing the disruptions to the trend line. Since the initial standard settings for the 1992 assessments, there have been three instances when changes to the frameworks have prompted research and discussion about the integrity of the trend line.2 There is also a pending decision about digitally based assessments.
In this chapter, we discuss these changes and how they were handled. We review the decisions that were made, their rationale, and their consequences. We consider this information in relation to our overall recommendations for achievement-level reporting.
1 NAEP frameworks reflect the ideas of many individuals and organizations involved in reading or mathematics education, including researchers, policy makers, teachers, business representatives, and representatives of the public. See Chapter 1 for a more detailed discussion of the framework.
2 Those changes were made to grade-12 mathematics in 2005, to grade-12 mathematics in 2009, and all grades in reading in 2009.
When the framework or mode of administration changes, there are three possible options for the main NAEP trend line:
- Establish a new trend line: The content and construct change so much that continuing the trend line is impossible. In this scenario, the old trend line would be discontinued at the next operational assessment and a new trend line begun.
- Gradually transition to a new trend line: There is a change in the construct being measured, but the old and new constructs are fairly well correlated. In this case, one option is to separate the trend into two lines for a few operational assessments: one trend line more driven by more traditional items and format(s) and one trend line more driven by the new items, formats and content. After a few assessments, the old line could be retired, and the new trend line would take over. Short-term comparisons on either trend line would make sense in the overlap years, but longer term comparisons crossing the overlap years would be more tenuous.
- Maintain the existing trend line: There is little to no change from the old construct to the new construct. In this scenario, the trend line is continued without concern.
Every time substantive changes are made to the framework for main NAEP, the National Assessment Governing Board (NAGB) must investigate whether a new standard setting is called for. For each assessment, the achievement levels were set with particular content, represented by a construct and a pool of items. Thus, in parallel with the discussion of maintaining trend lines, each of the possible scenarios has implications for maintaining the achievement levels:
- Conduct a new standard setting: If the content and constructs change so much that a trend is broken, then the old achievement levels will be meaningless at best and misleading at worst on the new scale, and new achievement levels will be needed.
- Maintain the cut scores, but revise the achievement-level descriptors (ALDs): If the old and new constructs are different but highly correlated, it may not be urgent to change the achievement levels, but it is no less necessary. To the extent that the new items assess content and skills that were not considered when the previous achievement levels were set, the old achievement levels will no longer reflect performance as represented by the new item and task pool. In this scenario, it may be possible to rework the ALDs without resetting the cut scores.
- Make no changes: If the old and new constructs are different but statistically indistinguishable, it may be possible to make no changes.
Revisions to the NAEP framework for mathematics reflected changes in curriculum and goals for mathematics education. Even though the NAEP framework is an assessment framework and not a curriculum framework, changes in curriculum and goals necessitate changes in the assessment framework.3 The curricular changes and goals related to mathematical proficiency are described in various publications, including the Adding It Up (National Research Council, 2001) and the 1989 and 2000 standards of the National Council for Teachers of Mathematics (NCTM).4,5 To understand the reasoning behind these curricular changes, we first present some history; we then turn to the framework changes and how they were handled.
Efforts to reform U.S. K-12 mathematics go back many decades. They accelerated in 1957, partly in response to the U.S.S.R. launch of Sputnik. As recounted much later (Conference Board of the Mathematical Sciences, 2001, p. 4):
The school reform efforts of the 1960s and 1970s had a number of long-lasting influences, such as broadening elementary school mathematics beyond arithmetic to include some geometry and elements of algebra, and refocusing high school mathematics by downplaying analytic geometry and trigonometry and giving more attention to functions and providing an introduction to calculus. However, this period is most remembered for the New Math movement’s theoretical approach. This approach was widely rejected, leaving school mathematics reform efforts
3 An assessment framework is the guide to an assessment. A framework delineates the aspects of a given construct or content area (e.g., mathematics, reading) to be assessed and the relative emphasis to be placed on each topic at each grade level. That is, the NAEP framework suggests the mix of items in each content strand for each grade. It also suggests the proportional mix of item formats—multiple choice, short-answer constructed response, and extended-answer constructed response—to be included at each grade level. The framework relies on definitions of the constructs as they exist in the field: for example, how is reading comprehension defined by reading researchers, curriculum specialists, and educators? How is mathematical reasoning defined by mathematics researchers, curriculum specialists, and educators? (see Chapter 1 for a related discussion).
4 See http://www.nctm.org/Standards-and-Positions/Principles-and-Standards/ [January 2016].
on the defensive for many years to come.
The “new math” attempted to build mathematics learning from first principles. For example, arithmetic was taught by beginning with the axioms and properties of set theory. Two circumstances are cited as contributing to the lack of success of the new math. First, parents found the material unfamiliar to them and their schooling. Second, teachers were asked to teach something they did not understand (see Hayden, 1981; Kline, 1973; Phillips, 2015).
The rejection of the new math prompted the rise of the “back-to-basics” movement that dominated during the 1970s and early 1980s. Back-to-basics in mathematics emphasized arithmetic computation and rote memorization of algorithms and basic arithmetic facts.
The next change in the field resulted from the desire for more conceptually based content. This was a prime motivation for the standards movement that resulted in the 1989 NCTM Curriculum and Evaluation Standards for School Mathematics, known as the NCTM Standards.6 Even though the movement away from an emphasis on computation and memorization to an emphasis on conceptual understanding and reasoning brought back images of the new math, research on student learning as reflected later in Adding It Up (National Research Council, 2001) supported the change.
The framework for mathematics, used to develop the NAEP assessments between 1990 and 2003, was influenced by the 1989 NCTM Standards. As a part of the work, state-, district-, and school-level objectives were considered, as well as the frameworks on which previous NAEP mathematics assessments had been based and a draft version of the NCTM Standards. The result was a “content by mathematical ability” matrix design that was used to guide both the 1990 and 1992 mathematics assessments conducted by NAEP at the national and state levels.7
This framework consisted of five broad strands of mathematics content, three types of mathematical ability, and three concepts of mathematical power:
- numbers and operations,
- data analysis, statistics, and probability, and
- algebra and functions.
6 See http://www.education.com/reference/article/history-mathematics-educationNCTM/ [January 2016].
7 For details of the design, see National Assessment Governing Board (2004).
- conceptual understanding,
- procedural knowledge, and
- problem solving.
Concept of Mathematical Power
- connections, and
Beginning in the 1990s, NAEP mathematics assessments placed increasing emphasis on mathematical power. This change was reflected in the 1996, 2000, and 2003 assessments by focusing on reasoning and communication and requiring students to connect their learning across mathematical strands.8
The 1989 NCTM Standards were updated in 2000 and released as “Principles and Standards” leading to career and college readiness. Both the 1989 and 2000 documents reflected a view of K-12 mathematics as consisting of content and processes. These two components are incorporated into the 2010 Common Core State Standards as mathematical practices and content standards.
The Common Core Standards are more rigorous than those of some of the state standards they replaced, reflecting some of the more rigorous and successful state standards and the standards of the higher performing countries on international assessments. The Common Core Standards call for more algebra and more probability and statistics in grades 7 and 8. Coherence across grades is emphasized: development of algebraic thinking is designed to begin in the early grades; preparation for postsecondary study begins well prior to 12th grade. Reasoning is spread over all content areas rather than being mostly in geometry.
Framework for Mathematics: 20059
A new framework was adopted for the 2005 NAEP mathematics assessment. Two major curricular changes were reflected in the new framework: increased emphasis on data analysis, statistics, and probability for the 12th-grade assessment and increased emphases on algebra and
8 See https://nces.ed.gov/nationsreportcard/mathematics/previousframework.aspx [January 2016].
functions for both the 8th and 12th grades. In addition, the construct of reasoning was added to the framework for all three grades (4, 8, and 12).
The 2005 framework changed the cognitive dimension used to classify mathematics items for grades 4 and 8. This involved replacing the dimensions of mathematical ability and power (which require making inferences about the student responding to the item) with the dimension of mathematical complexity (which describes the mathematical knowledge expectations with respect to an item). Achievement levels, content areas, overall item types (multiple choice, short-answer constructed response, and extended-answer constructed response), the use of manipulatives, and the calculator policy did not change for grades 4 and 8 in 2005. The changes to the framework for grades 4 and 8 were judged to be minimal, and as a result, the trend line that began in 1990 was not interrupted.10
For 12th grade, the new framework reduced five broad strands of content to four, with measurement and geometry being combined into one strand. For the four content areas, there was a shift in the relative emphasis on each. Geometry and measurement was increased from 15 and 20 percent, respectively, to 30 percent for the two combined. Number properties and operations were reduced from 20 to 10 percent; data analysis and probability was increased from 20 to 25 percent; and algebra was increased from 25 to 35 percent. (See Table 7-1.)
The changes to the 12th-grade framework were judged to be substantive enough that a break in the trend line was deemed necessary, and a new standard setting was warranted. A new standard setting was done using the Mapmark method rather than the Angoff method (see Chapter 3).11 At this time, the ALDs were revised, and the scale score range was changed. The maximum scale score was lowered from 500 to 300, rendering the grade-12 scale different from the scale for grades 4 and 8.
One of the major goals in the creation of the Common Core State Standards was for mathematics to become substantially more focused and coherent across grades as is recommended by many in the field of mathematics education (see Daro et al., 2011; Schmidt et al., 2002, 2005; Watanabe, 2005).12 Toward this end, it is useful to report results in a way that reinforces this conception, such as using the same score scale for each grade and/or having ALDs that build on each other across grades. This was not done in 2005.
10 See https://nces.ed.gov/nationsreportcard/mathematics/frameworkcomparison.aspx [October 2016].
11 Mapmark is a variation of the Bookmark method: for a description, see Schulz and Mitzel (n.d.).
TABLE 7-1 Changes to the Grade-12 NAEP Mathematics Assessment in 2005a
|2005 Mathematics Assessment||Previous Mathematics Assessment|
|Content areas||Four content areas with measurement and geometry are combined because the majority of grade-12 measurement topics are geometric in nature.||Five content areas|
|Distribution of Questions Across Content Areas|
||30%||15% and 20%|
|Reporting Scale||0-300 single-grade scale||0-500 cross-grade scale|
|Calculators||Students are given the option to bring their own graphing or scientific calculator.||Students are provided with standard model scientific calculator.|
aThis table was added after the report was initially transmitted to the U.S. Department of Education; see Chapter 1 (“Data Sources”).
Framework for Mathematics: 200913
The 2005 framework was revisited in 2009, and additional changes were made as shown below:
- Objectives for grades 4 and 8 remain the same
- New topic of “mathematical” reasoning added at grades, 4, 8, and 12
- New objectives for grade 12 were introduced
- New clarifications and new examples to describe the levels of mathematical complexity added to framework
For all grades, there was an increased emphasis on reasoning, especially in content areas other than geometry. For grade 12, the framework was changed to enable NAEP to report on academic preparedness for college. As characterized by NAGB (National Assessment Governing Board, 2015, p. 2):
The goal for this 12th-grade initiative is to enable NAEP to report on how well 12th-grade students are prepared for postsecondary education and training. The challenge was to find the essential mathematics that can form the foundation for these postsecondary paths. Analysis of the 2005 mathematics framework revealed that some revisions would be necessary to meet this challenge.
A study was conducted at grade 12 to compare results based on the 2009 and 2005 mathematics assessment instruments. In this case, the trend line was not disrupted, but the ALDs were revised based on results from an anchor study.
For grades 4 and 8, the revisions were again judged to be minimal (see footnote 10), and only small changes were made to the ALDs: “reason” or “reasoning” appears twice in the expanded explanation of “Proficient” at grade 8 and once in the final sentence of the expanded explanation for “Advanced” at grade 8 (see National Assessment Governing Board, 2014, pp. 72-73). Anchor studies were not conducted for grades 4 and 8.
The framework first adopted for reading in 1992 was in place through 2007. In line with evolving understanding in the field of reading, the framework was changed for the 2009 reading assessment, and that version remains in place.
The current framework conceptualizes reading as an active and complex process that involves
- understanding written text,
- developing and interpreting meaning, and
- using meaning as appropriate to type of text, purpose, and situation.
The last bullet of this definition reflects the changing understanding of reading. Earlier conceptions treated comprehension as an endpoint. It is now conceptualized to include not only a reader’s act of constructing meaning, but also using the meaning that is constructed through reading. That is, one reads both to comprehend and to use what is comprehended for further understanding.
The work of three different groups influenced the definition of reading embodied in the framework. One was from RAND Reading Study Group (Snow, 2002, p. 11):
Reading comprehension [is] the process of simultaneously extracting and constructing meaning through interaction and involvement with written language. It consists of three elements: the reader, the text, and the activity or purpose for reading.
Another was from the Progress in International Reading Literacy Study (Ogle et al., 2003, p. 3):
The ability to understand and use those written forms required by society and/or valued by the individual. Young readers can construct meaning from a variety of texts. They read to learn, to participate in communities of readers, and for enjoyment.
The third was from the Programme for International Student Assessment (OECD, 2009, p. 23):
[Reading literacy is] understanding, using, and reflecting on written texts, in order to achieve one’s goals, to develop one’s knowledge and potential, and to participate in society.
Among the changes to the 2009 framework were the following:
- Its design was based on current scientific research in reading.
- It was adapted to be consistent with NCLB 2001.
- The content and preliminary achievement standards at grade 12 embodies reading and analytical skills judged to be needed for rigorous college-level courses and other productive postsecondary endeavors.
- In preparing the framework, extensive use was made of international reading assessments and exemplary state standards.
- For the first time in NAEP, vocabulary was measured explicitly.
- Poetry was assessed in grade 4 as well as in grades 8 and 12.
- Multiple-choice and constructed-response items (both short and extended) were included at all grades. In grades 8 and 12, about 60 percent of the assessment time was focused on constructed-response questions; at grade 4, about 50 percent.
The changes to the framework were foundational and required decisions about the extent to which a new standard setting, revisions to the ALDs, or other changes were needed. To inform these decisions, NAGB conducted an anchor study to evaluate alignment between the item pool and the ALDs. NCES conducted two types of studies: a content alignment study that compared the two reading frameworks and item pools and an analysis that examined the extent to which the score distribution would likely change with the use of the new framework and new items. The results from these studies led to the conclusion that the scores were sufficiently similar to continue using the same trend line. The decision probably also reflected the critical importance of maintaining the trend line for reading. Reading is perceived as the most powerful indicator of other achievement-related outcomes, including school success and completion and postsecondary success, and it is used by many states as a critical marker to identify areas for educational policy and practice. The anchor study led to changes in the ALDs but did not trigger a new standard setting.
NAEP is in the process of transitioning all subjects to digitally (computer-) based assessment. In mathematics and reading, field testing began in 2015, with the goal of administering such assessments in those fields in 2017.
Although moving old and new items to a digital format is not intended to change the content or skills being measured, it is possible that some changes will occur. A compelling use of the computer-based format is to develop new item types that allow for more authentic task performance, richer data collection, and access to aspects of performance, such as problem solving, that are a part of the NAEP frameworks but are difficult to measure with traditional items. New item types—such as scenario-based tasks and hybrid hands-on tasks—show the process by which students solve problems by recording the sequence of students’ interactions with real or simulated environments. Being able to see the process can be particularly illuminating about students’ skills in areas that are not well measured by paper-and-pencil tests. For example, the digitally based science assessment administered in 2009 revealed serious deficits in students’ abilities to design experiments, reason from data, and perform other problem-solving tasks in science.15
With new item types that are intended to measure different content,
15 See https://www.nagb.org/newsroom/naep-releases/science-hots-icts.html [November 2016].
there is concern that the construct being assessed will shift: indeed, to the extent that the content represents aspects of the assessment framework that have previously been difficult to measure, the construct has to be expected to shift. For example, digitally based assessment of reading alters the act of reading, ranging from differences between keyboarding and turning pages to navigation of electronic texts that may (or may not) place different demands on readers’ memory and attention systems.
The research on differences between reading in traditional and electronic forms is not conclusive; similarly, the research on differences between testing in traditional and electronic forms is not conclusive. Thus, ongoing scrutiny of the evolving relationship between constructs, ALDs, and NAEP assessments as experienced by students is warranted.
Since 1992, there have been many changes to the assessments of mathematics and reading in the main NAEP. For reading, changes in the framework for the 2009 assessment led to adjustments in ALDs, but not in the cut scores.
For mathematics, the move to measure reasoning skills led to adoption of a new framework for the 2005 assessment. For the grade-12 assessment, disrupting the trend line was deemed unavoidable: new ALDs were developed, and new cut scores were set. The range for the score scale was also changed, from a top score of 500 to a top score of 300. However, commensurate changes in the score scales for 4th- and 8th-grade mathematics were not made, and the ALDs were not revised. Four years later, the decision to measure academic preparedness led to adjustments in the 12th-grade framework for both reading and mathematics. The ALDs were updated but new cut scores were not set.
The changes to 12th-grade mathematics occurred in the context of the push for “21st-century skills” and the desire for students to graduate from high school ready for college or the workplace. The changes in the reading framework were, in part, designed to be responsive to the conception of reading specified in NCLB. The transition to digitally based testing allows for deeper assessment of skills that are increasingly valued, such as problem solving, critical thinking, and reasoning from data.
Changes in the policy context and priorities for student learning have changed NAEP from being a thermometer used to describe progress, then to a role model to set aspirations for progress, and then to an account-
ability lever to ensure that states adopt its aspirations and make progress toward them.
The number and extent of changes since 1992 would certainly support a recommendation to conduct new standard settings to reset the cut scores and revise the ALDs. Thus far, changes have been handled in a piecemeal fashion: new cut scores and ALDs for grade-12 mathematics in 2005; revised ALDs for grade-12 mathematics in 2009, but no changes for grades 4 and 8 since 1992; revised ALDs for all grades in reading in 2009, but no resetting of cut scores since 1992.
The committee recognizes that the achievement levels—although labeled as developmental and provisional—are a well-established part of NAEP with wide influence on state K-12 achievement tests. Making changes to something that has been in place for more than 24 years would likely have many consequences. We understand the difficulties that might be created by setting new standards, particularly the disruptions that would result from breaking trend-line information. We think this would be unwise at a time when so many other things are in flux—many states are transitioning to the Common Core State Standards and implementing the associated assessments; many other states are transitioning to similar standards and assessments; and NAEP is moving toward digital assessment with new item types. Moreover, Congress recently reauthorized the Elementary and Secondary Education Act (the Every Student Succeeds Act); it is not yet known what effects this may have on NAEP’s role in relation to state achievement tests.
The committee considered several courses of action, ranging from making no changes to conducting a completely new standard setting. For the past 24 years, the country has been tracking progress on three cut scores that are labeled and defined through the ALDs. The descriptors associated with these points can be revised and updated without conducting a completely new standard setting, as was done for reading and grade-12 mathematics in 2009. It is important that the descriptors be well aligned with other parts of the system—the framework, items, and cut scores. Disruption in the trend line could be avoided by continuing to follow the same cut scores but revising the descriptions of them.
Weighing the options, we conclude that responding to most of the significant arguments in favor of setting new standards can be addressed instead by revision of ALDs. Furthermore, we echo the recommendations in the report, NAEP: Looking Ahead, Leading Assessment to the Future (National Center for Education Statistics, 2012), which calls for achievement levels to be shifted “to the background” (p. 33) and supplemented by other methods for enhancing public understanding of score scale results.
The committee does not believe it would be productive at this time for the major effort that would be required for a completely new standard
setting when there are other lines of research important to pursue. In particular, we encourage research to define cut points or benchmarks that are linked to external criteria, such as gradations of readiness for college and the workplace. More recently, policy makers and the public are looking to NAEP for defining and tracking performance on specific benchmarks, such as college and workplace readiness or global competitiveness. Some of these benchmarks have been designed and some are under way.
Other benchmarks that the committee judges would be valued in this policy context include a benchmark for the 8th-grade assessments that flag the likelihood of a college and career-ready high school diploma; a benchmark for the 4th-grade assessments that measures readiness for 5th grade; and a benchmark to measure progress toward being one of the top 5 or 10 countries on the Trends in International Mathematics and Science (TIMSS) assessment or the Programme for International Student Assessment (PISA). These benchmarks can be established independently of the achievement levels.
CONCLUSION 7-1 The cut scores for grades 4 and 8 in mathematics and all grades in reading were set more than 24 years ago. Since then, there have been many adjustments to the frameworks, item pools, assessments, and achievement-level descriptors, but there has been no effort to set new cut scores for these assessments. Although priority has been given to maintaining the trend lines, it is possible that there has been “drift” in the meaning of the cut scores such that the validity of inferences about trends is questionable. The situation for grade-12 mathematics is similar, although possibly to a lesser extent because the cut scores were set more recently (in 2005) and, has thus far, only one round of adjustments has been made (in 2009).17
CONCLUSION 7-2 Although there is evidence to support conducting a new standard setting at this time for all grades in reading and mathematics, setting new cut scores would disrupt the NAEP trend line at a time when many other contextual factors are changing. In the short term, the disruption in the trend line could be avoided by continuing to follow the same cut scores but ensuring the descriptions are aligned with them. In particular, work is needed to ensure that the mathematics achievement-level descriptors (ALDs) for grades 4 and 8 are well aligned with the framework, cut scores, and item pools.
Additional work to evaluate the alignment of the items and the ALDs for grade-4 reading and grade-12 mathematics is also needed. This work should not be done piecemeal, one grade at a time; rather, it should be done in a way that maintains the continuum of skills and knowledge across grades.18