| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 132
7
Issues in Phasing Out Trend NAEP
Michael ]. Kolen
This paper considers ways in which the long-term trend National Assess-
ment of Educational Progress (NAEP) can be phased out and replaced by the
main NAEP while still maintaining a long-term trend line. Relevant history of
NAEP is presented with a focus on those aspects that led to separating long-term
trend NAEP and main NAEP. Differences between the two assessments are
discussed, including differences in content, operational procedures, examiner
subgroup definitions, analysis procedures, and results. Four designs for assessing
long-term trends with NAEP are considered. Evaluation of these designs addresses
how their implementation would affect main NAEP and the assessment of long-
term trends. The paper concludes with recommendations for research and recom-
mendations for the designs that should receive further consideration.
The recommendations focus on two designs. In one promising design, long-
term trends are monitored with main NAEP, and overlapping main NAEP assess-
ments are used whenever an assessment is modified. Implementation of this
design requires extensive research. Because long-term trends are assessed with
main NAEP in this design, modifications of main NAEP to reflect curricular
changes must be tightly constrained. In another promising design, a separate
long-term trend assessment is used that is periodically updated. This design can
continue to provide long-term trends without an extensive research program. It
also allows for main NAEP to change, as necessary, to reflect curricular changes.
Drawbacks of this second design are that it requires continuing both the main
NAEP and the long-term trend NAEP programs and it allows for only small
changes in long-term trend NAEP.
132
OCR for page 133
MICHAEL J. KOLEN
133
INTRODUCTION
NAEP "is mandated by Congress to survey the educational accomplishments
of U.S. students and to monitor changes in those accomplishments" (Ballator,
1996:1~. Originally, NAEP surveyed educational accomplishments and long-
term trends with a single assessment. Because of continual changes in the assess-
ments, NAEP has evolved into a collection of state and national assessments.
The main NAEP is designed to be flexible enough to adapt to changes in assess-
ment approaches. The long-term trend NAEP is intentionally constructed and
administered to be stable so that trends in student performance can be examined
over time. Whereas both main NAEP and long-term trend NAEP focus on
assessing achievement for the nation and for various subgroups of students, state
NAEP, which is the most recent addition to NAEP, focuses on achievement of
students by state. Main NAEP and long-term trend NAEP have distinct assess-
ment exercises and administration procedures.
The National Assessment Governing Board (NAGB) oversees policy for the
NAEP program and has called for NAEP to be redesigned (NAGB, 1996~. One
of NAGB's concerns involves the apparent inefficiency in continuing to maintain
both main NAEP and long-term trend NAEP. To address this concern, NAGB
(1996:10) has stated that "it may be impractical and unnecessary to operate two
separate assessment programs.. . . A carefully planned transition shall be
developed to enable 'the main National Assessment' to become the primary way
to measure trends in reading, writing, mathematics, and science in the National
Assessment program." Many individuals and committees have expressed con-
cern that the transition suggested by NAGB might result in losing the currently
available long-term trends (e.g., Jones, 1996; Glaser et al., 1996, 1997; National
Research Council, 1996~. In response to this concern, NAGB no longer plans to
use main NAEP as the primary way to measure trends; however, there might be
. In. . . . .
neti~c~enc~es In nerving two programs.
This paper was commissioned by the National Research Council to discuss
ways in which long-term trend NAEP could be phased out and replaced by the
main NAEP assessments while still maintaining a long-term trend line. One
significant question to be addressed is the following: How can a single assess-
ment be developed that is stable enough to provide long-term trends while still
being flexible enough to adapt to changes in assessment approaches? Another
significant question is: How can such an assessment be implemented without
losing the current long-term trend line?
The history of NAEP is considered with a focus on those aspects that led to
separating long-term trend NAEP and main NAEP. Those aspects include
changes to the NAEP purpose with the first redesign in the mid-1980s and prob-
lems that were encountered in measuring trends with the redesigned assessment,
such as those involving the NAEP reading anomaly (Beaton and Zwick, 1990;
Zwick, 1991~. Relevant components of the current redesign effort are summa
OCR for page 134
34
ISSUES IN PHASING OUT TREND NAEP
rized, and characteristics of the current main NAEP and long-term trend NAEP
assessments are compared on their content and administration procedures. This
comparison facilitates a discussion of how the two assessments might be replaced
by a single assessment.
Different designs for assessing long-term trends with NAEP are discussed
next. These designs include ones that involve overlapping trend lines, such as
those suggested by Glaser et al. (1997) and Forsyth et al. (1996~. The evaluation
of these designs includes considering how their implementation would affect the
measurement of long-term trends as well as the effect on the main NAEP assess-
ments. The paper concludes with recommendations for research to further evalu-
ate the different design possibilities, along with recommendations about which
designs should receive further consideration.
RELEVANT NAEP HISTORY
Jones (1996) presented a history of NAEP with a focus on procedural changes
that occurred at various stages of its evolution. These stages include the original
development of NAEP in the early 1960s, the first operational NAEP in 1969, the
first redesign in the early 1980s, and the current redesign effort. The portions of
Jones's discussion that are relevant to the relationship between main NAEP and
long-term trend NAEP are summarized here. The original "goals of NAEP were
to report what the nation's citizens know and can do and then to monitor changes
over time" (Jones, 1996:15~. From the beginning, NAEP was intended to be a
group-level assessment in which scores were not reported for individuals. How-
ever, significant changes have occurred in the assessment over time.
Originally, performance was reported exercise by exercise, but by the time of
the first redesign it was being reported on groups of exercises, often by objective.
Following the first redesign, exercises were scaled using item response theory
(IRT) procedures, and average scale scores were reported instead of percentages
correct by exercise or groups of exercises.
In the initial assessments, matrix sampling procedures were used in which
different sets of exercises were given to different examiners. With these proce-
dures, exercises were read aloud using tape-recorded presentations that minimized
the effects of reading ability and served to pace the presentation of exercises to
examiners. With the first redesign, a more efficient sampling design was used in
which examiners in a given room were administered different sets of exercises,
which resulted in elimination of tape-recorded presentations.
In the initial assessments, nearly everyone of ages 9,13, and 17 was included
in the sampling frame, but by the time of the redesign only individuals who were
in school and of ages 9,13, and 17 were assessed. Following the redesign, school
grades 4, 8, and 12 replaced ages 9, 13, and 17 as the primary basis for sampling
and reporting. Also, the procedures used for classifying individuals into popula-
tion groups differed considerably over time (Barron and Koretz, 1996~.
OCR for page 135
MICHAEL J. KOLEN
135
Jones (1996: 17) reported that the content of the assessments began to change
after the redesign, and "as curricular reform took center stage, NAEP began to be
viewed as an agent for change. Exercises began to focus on desired curricula
rather than on curricula already in place." In addition, Jones speculated that the
use of the IRT scaling procedures following the redesign affected the content of
the assessments. Following the redesign, fewer extremely easy or extremely
difficult exercises were chosen for the assessment, so that booklets did not neces-
sarily contain some very easy and very difficult exercises. A greater proportion
of exercises were multiple choice. Also, there was pressure to restrict exercises
to those with unidimensional properties to meet the assumptions of IRT.
To help maintain trend lines, "special 'bridge samples' were maintained
when operational changes were introduced [to NAEP in the 1982,1984, and 1986
assessments]. For bridge samples, conditions deliberately were kept similar to
those of earlier assessments to appraise change in achievement from earlier assess-
ments" (Jones, 1996:17~. With the 1985-1986 assessment the reading achieve-
ment of 9- and 17-year-olds appeared to decline more than a plausible amount
from 1984 and 1986. Upon further study it was found that several changes in
NAEP procedures, rather than actual changes in reading achievement, were
responsible for the decline. The apparent decline in reading achievement is now
known as the NAEP reading anomaly (Beaton and Zwick, 1990; Zwick, 1991~.
Because of these problems, the main NAEP and long-term trend NAEP
programs were separated following the 1985-1986 NAEP assessment. Main
NAEP is allowed to adapt to changes in assessment approaches. Attempts are
made to track short-term trends with main NAEP only when procedures are
comparable from one assessment to the next. Since 1985-1986, long-term trend
NAEP has been similar to those of earlier years, using the same booklets, admin-
istration procedures, and definitions of examined groups. Long-term trend NAEP
has allowed for tracking of important trends by "studiously maintaining condi-
tions of assessment that are sufficiently comparable over time to provide valid
evidence about achievement change" (Jones, 1996:18~.
The main NAEP and long-term trend NAEP assessments are not designed to
produce state-level data. In 1990, 1992, and 1994 voluntary trial state NAEP
assessments were conducted that produced state-level data to compare states to
one another and to the nation. These assessments were considered to be trial
assessments because of concerns about their usefulness. Potential benefits of
state-level NAEP data have been summarized by Phillips (1991) and potential
problems by Koretz (1991~. The National Academy of Education (1993) panel
that evaluated trial state NAEP recommended that it be continued but with
ongoing evaluation and congressional oversight. In 1996 the term trial was
removed from the title, and the assessments are now referred to as state NAEP.
The state NAEP assessments use representative subsets of main NAEP book-
lets. The two programs differ in administration procedures and other operational
procedures, such as who is included in the assessments. In addition, state NAEP
OCR for page 136
136
ISSUES IN PHASING OUT TREND NAEP
assesses only fourth- and eighth-grade students. Although NAGB (1996) has
considered combining the state NAEP and main NAEP assessments to increase
efficiency, these differences make combining them challenging (Mullis, 1997;
Rust, 1996; Spencer, 1996~. The use of state NAEP likely will increase pressure
for changing the assessment's content because a wider group of people in states
and school districts have a stake in NAEP. For this reason and because of
operational complexities, a decision to combine state NAEP and main NAEP
would complicate combining main NAEP and long-term trend NAEP.
DIFFERENCES BETWEEN MAIN NAEP
AND LONG-TERM TREND NAEP
The main and long-term trend assessments administered between 1986 and
1997 and that are planned to be administered after 1997 are summarized in Table
7-1. As is evident from this table, main NAEP covers many more subject areas
than long-term trend NAEP. From 1988 until the present, long-term trend NAEP
has used nearly the same procedures and exercises in each assessment. In addi
TABLE 7-1 Main NAEP and Long-Term Trend NAEP Assessments by Year
Since 1986a
Main NAEP
Long-Term Trend NAEP
1986 Reading. Mathematics, Science,
Computer Competence
1988 Reading, Writing, Civics, U.S. History
1990 Reading, Mathematics, Science
1992 Reading, Writing, Mathematics
1994 Reading, U.S. History, Geography
1996 Mathematics, Science
1997 Arts (grade 8 only)
1998 Reading, Writing, Civics
1999
2000
2001
2002
2003
Mathematics, Science
U.S. History, Geography
Reading, Writing
Civics, Foreign Language
(grade 12 only)
2004 Mathematics, Science
2005 World History, Economics
2006 Reading, Writing
2007
2008
Arts
Mathematics, Science
Reading, Mathematics, Science
Reading, Writing, Mathematics,
Science, Civics (ages 13 and 17 only)
Reading, Writing, Mathematics, Science
Reading, Writing, Mathematics, Science
Reading, Writing, Mathematics, Science
Reading, Writing, Mathematics, Science
Reading, Writing, Mathematics, Science
Reading, Writing, Mathematics, Science
Reading, Writing, Mathematics, Science
aAssessments administered from 1986 to 1994 were adapted from Allen et al. (1996); small special-
interest assessments are not shown. Assessments administered from 1996 to 2008 are from NAGB
(1997). Future assessments reflect plans.
OCR for page 137
MICHAEL J. KOLEN
137
lion, there has been sufficient stability in content frameworks and procedures for
long-term trend NAEP to allow for reporting long-term trends as far back as 1970
(Campbell et al., 1997~. In contrast, main NAEP assessments have been allowed
to differ from administration to administration so that results on one administra-
tion of main NAEP often are not comparable to those from previous administra-
tions. In addition, the main NAEP and long-term trend NAEP assessments in the
same-subject matter areas differ considerably in assessment content and opera-
tional procedures. Thus, results from the main NAEP and long-term trend NAEP
assessments for the same subject area are not directly comparable.
Barron and Koretz (1996) have summarized many of the differences be-
tween main NAEP and long-term trend NAEP in content, operational procedures,
examinee subgroup definitions, analysis procedures, and results. Some of their
major findings are discussed here. They reported that the content of the two
assessments is different:
The trend assessments are based on content frameworks that were developed
for the 1983-84 assessments in reading and writing or the 1985-86 assessments
in mathematics and science. Since the development of these frameworks, sub-
stantial changes have occurred in the objectives that content experts believe
teachers should emphasize. The current practice is to make the changes in the
main NAEP assessment called for by content experts and supported by the
National Assessment Governing Board, but to leave the trend assessment frame-
works undisturbed. (Barron and Koretz, 1996:215)
They also reported that the exercise formats for long-term trend NAEP are mainly
multiple choice, whereas main NAEP includes a much larger proportion of con-
structed-response exercises.
They reported that there are also many differences between the two assess-
ments in operational procedures and definitions of examiner subgroups. Main
NAEP oversamples minority populations to allow for relatively precise subgroup
comparisons. Oversampling of minorities is not done with long-term trend NAEP,
which leads to "insufficiently precise" assessment of trends in minority-group
performance (Barron and Koretz, 1996:214~. In addition, main NAEP primarily
uses grade-based sampling and reporting at grades 4, 8, and 12, whereas long-
term trend NAEP primarily uses age-based sampling and reporting at ages 9, 13,
and 17. Procedures for identifying minority groups differ in the two assessments.
For example, for race "the variable used in the main assessment, called derived
race because it combines information from multiple sources, gives priority to
student-reported information about race and ethnicity.... The variable used in
the trend assessment, called observed race . . . is simply the exercise admin-
istrator's judgment as to the racial-ethnic background of each student" (Barron
and Koretz, 1996:226~. In addition, the main NAEP assessments use a focused
design, in which an examiner is administered exercises from a single subject
area. In long-term trend NAEP an examiner is administered exercises from more
OCR for page 138
138
ISSUES IN PHASING OUT TREND NAEP
than one subject-matter area. This difference in administration design leads to
each student spending less time on a particular subject area in long-term trend
NAEP than in main NAEP.
Similar analysis procedures are used for the two programs; however, "in
recent years, the main assessment has used a far greater number of background
variables in its conditioning (Barron and Koretz,1996:220~. Furthermore, differ-
ent score scales are used, which can create difficulties in comparing the two
assessments. Performance levels are used in reporting performance for main
NAEP but not for long-term trend NAEP.
The many differences between the two assessments could influence conclu-
sions about student achievement in the United States, both at a given time and in
trends over time. For example, Barron and Koretz (1996:241-242) have specu-
lated that "trends likely would have been somewhat different if the trend assess-
ment had more closely resembled the current main assessment [in content)," that
"the use of age-defined rather than grade-defined samples appears to be influenc-
ing both the overall trend line and the trends for specific population groups," that
"differences in the method for grouping students into population groups . . . had
major effects on the classification of Hispanic students," and that "overall trends
for populations as a whole might be different if the trend assessment had a mix of
formats more similar to that of the main NAEP assessment."
In certain situations main NAEP has given different results than long-term
trend NAEP. In an example provided by Barron and Koretz (1996), the main
NAEP assessments indicated greater relative gains in writing achievement in
high school than did the long-term trend NAEP writing assessment.
To provide a more recent example, the difference between males' and
females' scores on the 1996 main NAEP science assessment is compared to the
difference on the 1996 long-term trend NAEP science assessment at selected
percentiles. Tables 7-2 and 7-3 provide the results used to make this comparison.
Because the two assessments are reported on different metrics, the differences
were standardized using the semiinterquartile range for the total group,
Q = (P75 - P25~/2. (The standard deviation could not be used to standardize the
differences because it was not reported by O'Sullivan et al. (1997~. In addition,
Q may be preferable to the standard deviation for standardizing percentiles
because it is a percentile-based statistic.) As shown in Figure 7-1, the standard-
ized differences are larger on long-term trend NAEP than on main NAEP at all
percentiles and grades. Although it is difficult to determine the cause of this
difference, it is possible that the greater use of multiple-choice exercises on the
long-term trend NAEP assessment than on the main NAEP assessment is partly
responsible.
In summary, main NAEP and long-term trend NAEP differ in content, exer-
cise types, subgroup definitions, operational procedures, and analysis procedures.
Although these differences likely affect assessment results, it is difficult to tell
exactly how.
OCR for page 139
MICHAEL J. KOLEN
139
TABLE 7-2 Differences in Selected Percentiles Between Males and Females
in Main NAEP Sciencea
All
Male
Female
Difference Difference/Q
Grade 4
Plo 105 105 105 0 .000
P25 130 130 129 1 .046
P50 153 154 152 2 .093
P75 173 175 172 3 .140
P9O 190 191 188 3 .140
Grade 8
Plo 104 103 104 -1 -.043
P25 128 128 128 0 .000
P50 153 154 151 3 .130
P75 174 175 172 3 .130
P9O 192 194 190 4 .174
Grade 12
Plo 104 103 105 -2 -
P25 128 129 127 2
P50 152 155 150 5
P75 174 178 171 7
P9O 192 196 187 9
apercentiles are from O'Sullivan et al. (1997); Q = (P7s - P25 )/2.
TABLE 7-3 Differences in Selected Percentiles Between Males and Females
in Long-Term Trend NAEP Sciencea
All Male
Female
Difference Difference/Q
Age 9
Plo 174.5 176.5 172.3 4.2
P25 201.3 202.9 200.0 2.9
P50 231.0 232.7 229.6 3.1
P75 258.9 262.1 256.2 5.9
P9O 283.6 286.9 279.0 7.9
Age 13
Plo 105.3 208.9 202.4 6.5 .248
P25 230.4 233.9 227.6 6.3 .240
P50 257.7 262.4 253.6 8.8 .335
P75 282.9 288.6 277.3 11.3 .430
P9O 304.4 309.3 298.4 10.9 .415
Age 17
Plo 235.1 234.0 235.8 -1.8 -.059
P25 265.9 268.9 263.3 5.6 .182
P50 298.2 303.9 293.3 10.6 .345
P75 327.3 333.2 321.7 11.5 .375
P9O 351.7 358.6 344.1 14.5 .472
aPercentiles are from Campbell et al. (1997); Q = (P75 - P25 )/2.
OCR for page 140
140
0.50
0.40
Standardized 0 30
Male - Female 0.20
Difference
0.10 ~
0.00
-0.10
0.50
0.40
Standardized 0 30
Male - Female 0 20
D'fference
0.10
0.00 ~
-0.10
0.50
0.40 ~
Standardized 0 30
Male- Female
Difference 0.20
0.10
0.00
-0.10
ISSUES IN PHASING OUT TREND NAEP
Age 9 or Grade 4
10
B
W:~
-
/
B
J
25 50 75 90
Selected Percentiles
Age 13 Or Grade If
10
10 25
25 50 75 90
Selected Percentiles
~'J''/~'
Age170rGrade12
/
/
/
/
50 75 90
Selected Percentiles
B Long-Term Trend
J Main
B Long-Term Trend
J Main
B Long-Term Trend
J Main
FIGURE 7-1 Standardized male-female differences in 1996 long-term trend NAEP and
main NAEP selected percentiles.
OCR for page 141
MICHAEL J. KOLEN
14
ALTERNATIVE DESIGNS FOR MAIN NAEP AND
LONG-TERM TREND
The design for NAEP involves using main NAEP to assess current achieve-
ment and long-term trend NAEP to monitor trends. Main NAEP is allowed to
change to reflect current thinking in education. Long-term trend NAEP has
remained the same since the mid-1980s; even the same exercises are used from
one long-term trend NAEP assessment to the next. In this section, alternative
designs are discussed for the main NAEP and long-term trend NAEP assessments.
Design 1: Keep the Current Design
One possibility is to continue with the current design, which for long-term
trend NAEP uses the same exercises and operational procedures from one assess-
ment to the next. Even this tightly constrained design runs the risk, over time,
that certain exercises will change in how they function. When such changes
occur, the assessment of long-term trends in proficiency is threatened. As Zwick
(1992:207-208) has pointed out, one "pitfall of preserving portions of the assess-
ment is that, in the case of some items, the relation of item performance to overall
proficiency . . . may be altered because of curricular and societal changes." She
discussed an example from the NAEP science assessment on acid rain that was
included on the 1978, 1982, 1986, and 1988 assessments. Presumably, because
of the increased exposure of the problem of acid rain in the news media, rather
than increases in science proficiency, this exercise became easier. Situations
might also occur in which the content of certain exercises becomes dated, result-
ing in exercises becoming more difficult over time, even though the proficiency
being measured by the assessment does not decrease. Zwick (1992:208) con-
cluded that "an item that remains the same across assessments in a superficial
sense may nevertheless function differently as a measure of proficiency."
In addition, the content of an assessment can become less relevant as a
measure of achievement in current curricula. As curricula change, certain aspects
that are reflected in a particular assessment might come to be emphasized less or
not at all. In addition, new aspects may be introduced that could not possibly
have been included in an earlier assessment. Presumably, these sorts of changes
in curricular emphasis have been behind the frequent changes that have occurred
in main NAEP, which often have made it difficult to measure even short-term
trends with main NAEP.
Goldstein (1983) concluded that it is difficult to separate changes in particular
exercises from changes in the proficiency being measured. He reasoned that, if
certain exercises become easier over time and other exercises more difficult
(which is likely to be the case with almost any assessment over a long enough
period), measuring absolute trends in achievement might not be useful. Due to
these difficulties, Goldstein concluded that, over time, focusing on relative com
OCR for page 142
42
ISSUES IN PHASING OUT TREND NAEP
parisons would be more useful than focusing on absolute comparisons. For
example, the differences between males and females in science achievement
might be examined to ascertain if the gap is narrowing. Such a comparison could
be made, even if the assessments given at different times are not directly compa-
rable in their content.
Despite the concerns discussed by Goldstein, NAEP has continued to track
what he refers to as absolute trends in achievement. Jones (1996:20) concluded
that "the primary worth of NAEP has been as a monitor of changes in achieve-
ment for the nation." To maintain trend lines with long-term trend NAEP, the
exercises have remained the same. However, for each assessment, analyses are
conducted to ascertain whether the exercises are functioning in the same way as
in previous assessments. Exercises are excluded from long-term trend assess-
ments for reasons that include being very difficult, having poor fit to the IRT
model, and showing large changes in parameter estimates from previous assess-
ments (Allen et al., 1996~. Although these procedures can help maintain trend
lines, it can become difficult to separate actual changes in proficiency from
changes in the functioning of particular exercises. Also, as stated previously,
curricula might change so much that the relevance of the long-term trend assess-
ment to current curricula becomes questionable. For these reasons, some changes
in the long-term trend assessments are inevitable if the assessments are to provide
educationally relevant information.
One other concern about long-term trend NAEP is that it does not take into
account recent advances in data analysis procedures, such as the extensive use of
conditioning variables and updated subgroup definitions. To maintain stability,
long-term trend NAEP continues to use procedures developed in the 1980s.
Zwick (1992:206) asked, "How can NAEP maintain continuity while staying
current?" As suggested in this section, addressing Zwick's question should take
into account the content of the assessments, the operational procedures for admin-
istering the assessments, and the societal context in which the assessments are
made. This paper now explores alternative designs that might be used.
Design 2: Periodically Update Long-Term Trend NAEP While
Maintaining Main NAEP
One possible change in the design of NAEP allows for relatively small
modifications in the content of long-term trend NAEP while still maintaining
both long-term trend NAEP and the main NAEP. With this design, main NAEP
could continue to evolve to reflect curricular trends, unimpeded by the necessity
to maintain long-term trends. However, unlike the current design, periodic modest
changes are allowed in the content of long-term trend NAEP in an attempt to
avoid problems associated with "the relation of item performance to overall
proficiency . . . [being] altered because of curricular and societal changes"
(Zwick, 1992:208~. In this design the current long-term trend NAEP would
OCR for page 143
MICHAEL J. KOLEN
143
continue to be used, with small modifications allowed, and the operational condi-
tions of the long-term trend NAEP assessment would remain consistent over
time. However, this design allows for replacement of some of the exercises used
in the long-term trend NAEP assessment to avoid many of the problems identi-
fied by Zwick.
This design for long-term NAEP has many similarities to the designs of other
large-scale assessment programs that use alternate forms of assessments for
reasons of security. The ACT Assessment (ACT, 1997) and SAT (Donlon, 1984)
which are used for college admissions purposes, are among the many assess-
ments that use alternate forms. In these assessments, different exercises are used
on each administration. Careful development procedures involving tight specifi-
cations are used to ensure that the alternate forms each measure the same con-
structs in similar ways. Although efforts are made to build alternate forms to be
approximately equal in difficulty, equating procedures are used to adjust for the
small differences in difficulty that are present (Kolen and Brennan, 1995~. The
procedures in these assessment programs are used to ensure that scores on the
alternate forms can be used interchangeably regardless of the time at which the
examinee is assessed or the particular alternate form that is administered. Used in
tandem, the assessment development and equating procedures allow for compar-
ing scores and assessing trends, even when completely different assessment
exercises are used at different times. The general concepts of developing alter-
nate forms of an assessment and equating could be used in a new long-term trend
NAEP assessment design.
One difference between NAEP and assessments that routinely use equating
processes is that NAEP uses a set of booklets, with different students adminis-
tered different booklets. This type of design is made possible because group-
level scores are reported, with no scores being reported to individual examiners.
To consider equating processes with NAEP, an alternate form of NAEP assess-
ment is defined as the set of booklets that are administered to examiners in an
assessment. Using this idea, assessment specifications for NAEP are defined at
the level of the set of booklets. To use an equating process with alternate NAEP
forms (i.e., alternate sets of NAEP booklets), content specifications need to be
developed and defined at the level of a set of NAEP booklets. Such specifica-
tions present the content, skills, and exercise types to be included in sufficient
detail to ensure that the alternate NAEP forms measure the same educational
constructs in the same way. Statistical specifications need to be developed so that
the alternate NAEP forms are of nearly the same difficulty.
An equating process for long-term NAEP could involve randomly assigning
students to take old and new assessments. Alternatively, a set of exercises from
a previous assessment could be used as part of the new assessment. If used, this
set of common exercises fully represents the content of the total assessment so
that it serves to link one assessment to the next. By treating sets of NAEP
OCR for page 144
44
ISSUES IN PHASING OUT TREND NAEP
booklets as alternate forms, the procedures for designing equating studies dis-
cussed in Kolen and Brennan (1995) apply.
An equating process could accommodate removing exercises from a long-
term trend NAEP assessment when they become dated or when, as Zwick (1992)
has pointed out, the relationship of exercise performance to overall achievement
changes over time. Also, exercises could be removed if security concerns arise
pertaining to particular exercises on NAEP assessment. An equating process can
tolerate periodic updating of content as long as the updating does not affect the
constructs being measured. For example, with the ACT assessment, "curriculum
study is ongoing . . . . ACT assessment tests are reviewed on a periodic basis"
(ACT, 1997:4~. ACT accommodates some changes to the content of the assess-
ments within the context of the process of equating alternate forms.
The measurement of long-term trends in NAEP using an equating process
depends on developing tight assessment specifications that allow for the develop-
ment of alternate forms of long-term trend NAEP. The specifications should
remain stable over time, with only modest updating of the specifications allowed.
The context in which the common exercises appear needs to be constant from one
assessment to the next, and the operational procedures used for the assessment
need to be preserved from one assessment to the next. In addition, with this
design, sample sizes for minorities should be increased to address the concern
expressed by Barron and Koretz (1996) that assessment of trends for minorities is
not sufficiently precise. One major limitation is that this design cannot directly
accommodate major changes in specifications or frameworks. For example, if
the frameworks for long-term trend NAEP were updated to be much more similar
to those for the current main NAEP, it would not be possible to equate the
resulting long-term trend NAEP to the previous one. In this event, special studies
would be required to link the two assessments if long-term trends were to be
followed from one long-term trend assessment to another.
Design 3: Eliminate Long-Term Trend NAEP and
Use Main NAEP for Trend Assessment
NAEP faces two formidable challenges if long-term trend NAEP is elimi-
nated. First, the existing long-term trend comparisons for NAEP need to be
preserved. As described earlier, main NAEP has evolved substantially and now
is quite different from long-term trend NAEP. A study that links main NAEP to
long-term trend NAEP might be used to preserve trends. Second, if main NAEP
is used to assess trends in NAEP, it needs to be more stable than it has been in the
past. For long-term trends to be preserved when substantial revisions are made to
main NAEP, the revised assessment needs to be linked to the previous ones.
These linking studies are much more challenging to conduct than equating studies
because the assessments differ. In the Mislevy (1992) and Linn (1993) terminol-
ogy, the processes of projection or statistical moderation would be used in these
OCR for page 145
MICHAEL J. KOLEN
145
linking studies. Suggestions for how the data might be collected to conduct these
linkages are described later in this section. The linkages that result from these
processes are considerably weaker than equatings because of the substantial dif-
ferences in the content of the assessments.
The major differences between the current long-term trend NAEP and main
NAEP assessments that were described by Barron and Koretz (1996) and summa-
rized earlier in this paper present significant challenges to linking these assess-
ments. Along with differences in content and exercise types, these include
differences between the two assessments in operational and analysis procedures.
For example, as discussed earlier, the main NAEP assessment uses "derived
race," whereas the long-term trend assessment uses "observed race." Barron and
Koretz (1996) suggested that differences in subgroup definitions could affect the
classification of examiners to subgroups. Another related issue is that main
NAEP assesses students at fourth, eighth, and twelfth grades, whereas long-term
trend NAEP assesses individuals at ages 9, 13, and 17.
The first step in eliminating long-term trend NAEP using this design is to
estimate the effect of subgroup and age/grade definitions on long-term trend
NAEP. In a single year, long-term trend NAEP would be conducted using both
the current long-term trend NAEP subgroup definitions and the current main
NAEP subgroup definitions. Independent examinee samples could be used for
this study. This linking study estimates the effect of changes in subgroup defini-
tions on long-term NAEP trends. For example, this study estimates what the
long-term trends would have been had "derived race" been used instead of
"observed race" with long-term trend NAEP. Similar estimates are made of long-
term trends for grade groups instead of for the age groups typically used with
long-term trend NAEP.
In a second study, long-term trend NAEP is linked to main NAEP. In the
same year as the first study, the main NAEP assessment could be conducted using
a group of examiners that is independent of the group used in the long-term trend
linking study. This study would be used to adjust for the effects of content
differences, differences in exercise types, and differences in administration con-
ditions (e.g., tape-recorded, paced, administrations in long-term trend NAEP as
compared to main NAEP conditions that are self-paced).
The results of these studies could be analyzed in two ways. In one main
NAEP is placed on the long-term trend NAEP scale, with trends continuing to be
reported on the long-term trend NAEP scale. Following this process, main NAEP
is reported on two scales: the main NAEP scale to report current NAEP perfor-
mance and the long-term trend NAEP scale to report long-term trends. The other
possibility is to place previous NAEP trend assessments on the current main
NAEP scale. This second possibility involves reporting both long-term trends
and current proficiency on a single scale, which might cause less confusion in
assessment interpretation. This design is summarized in Figure 7-2, with the
OCR for page 146
146
Linking Study 1
Long-Term
Trend NAEP,
Original
Subgroup
Definitions
ISSUES IN PHASING OUT TREND NAEP
Long-Term
Trend NAEP,
Main NAEP
Subgroup
Definitions
.
Linking Study 2
Main NAEP
FIGURE 7-2 Studies for linking long-term trend NAEP to main NAEP.
arrows going from left to right to suggest that the long-term trend NAEP is placed
on the main NAEP scale.
Even if these studies were conducted, certain conceptual issues need resolu-
tion. For example, effects of content differences between the two assessments
are estimated using linking study 2. Implicit in this study is an assumption that
the effects of content differences estimated in the year the study is conducted also
hold for previous years (at least after controlling for year-to-year differences in
distributions of examiners within subgroups). It is possible that substantive
changes in education that occur between assessment cycles could affect the results
of the linking. This assumption could be assessed only by repeating the design
over multiple years. A decision needs to be made about whether interest is in
estimating subgroup differences on main NAEP or subgroup differences on the
previous long-term trend NAEP. If, as implied by Figure 7-2, all NAEP data are
reported on the scale of the current main NAEP assessment, the trends are esti-
mated for the current main NAEP assessment. Clearly, estimating such trends
entails strong statistical assumptions, since main NAEP was not administered in
previous years. As suggested by the results of Barron and Koretz (1996) and the
NAEP science data results presented in Figure 7-1 here, the decisions that are
made could affect the trends reported for various subgroups.
The analyses associated with these designs are complicated, methodology
for analyzing the data and estimating trends needs to be developed, and an exten-
sive research program is required. The research program might be initiated using
data that already exist from years in which main NAEP and long-term trend
NAEP were administered in the same subject-matter area in the same year. How-
ever, only preliminary studies of methodology could be conducted, unless data
exist that allow for assessing the effects of changes in subgroup definitions, as
would be investigated in linking study 1 of Figure 7-2.
OCR for page 147
MICHAEL J. KOLEN
147
For this design, linking studies 1 and 2 are conducted once. According to
NAGB (1996:15), "test frameworks and test specifications developed for the
National Assessment generally shall remain stable for at least ten years." When
major changes are made, however, a linking study, similar to linking study 2 in
Figure 7-2, is needed to link the new main assessment to the previous one.
NAGB (1996:15) also stated that "in rare circumstances, such as where signifi-
cant changes in curricula have occurred, the National Assessment Governing
Board may consider making changes to test frameworks and specifications before
ten years have elapsed." In such circumstances, linking studies are needed more
often than every 10 years.
Linkings such as those described above are much weaker than equatings.
Similar linkings have produced useful results in other assessment programs. For
example, when ACT revised the ACT assessment in 1989, the new version was
linked to the previous one (Brennan, 1989) for the English, mathematics, and
composite scores. The linking was used to maintain trend lines and to help
colleges update outscores. However, linking studies require strong statistical
assumptions, and it is always possible that the tracking of long-term trends could
be disrupted if the assumptions fail to hold. An extensive research program that
involves development of methodology and empirical research is needed before
NAEP adopts this linking design.
Design 4: Eliminate Long-Term Trend NAEP and
Maintain Two Main NAEPs for Trend Assessment
Zwick (1992) discussed maintaining an old and new main NAEP assessment
for some time whenever the NAEP was substantially revised. Forsyth et al.
(1996) and Glaser et al. (1997) expanded on Zwick's idea and suggested that at
least two main assessments be used at a time so as not to lose trends developed
with the previous assessment. In the design suggested by Forsyth et al., the
different main assessments are linked in some way to help maintain long-term
trends, although they did not describe how to conduct the linking. Compared to
the previous design that uses a single main NAEP assessment, the use of over-
lapping assessments with overlapping trends provides some insurance against
problems with links. If the linking methodology does not work properly, a few
administrations could be used to establish the linkages.
In most other respects, however, this design has the same problems as the
previous one. Main NAEP is still linked to long-term trend NAEP. New main
assessments are still linked to previous main assessments whenever there is a
major change in the assessments. The same sorts of conceptual issues remain,
such as the reporting metric for trends, and how to estimate subgroup trends. As
with the previous design, an extensive research program is needed to study proce-
dures for conducting the linking. Unlike the previous design, this one requires
OCR for page 148
148
ISSUES IN PHASING OUT TREND NAEP
that multiple assessments be maintained, and it has the potential to create confu-
sion because multiple reporting metrics will be used at any given time.
CONCLUSIONS AND RECOMMENDATIONS
Regardless of which design is used, changes in the context of NAEP con-
tinue to threaten any long-term trend NAEP assessment. For example, if NAEP
were to become a high-stakes assessment, widespread teaching to NAEP might
threaten long-term trend assessment (Zwick, 1992~. In a similar vein, Jones
(1996:19) expressed concern that, with the adoption of state NAEP, "if NAEP
materials were to be used for high-stakes assessment at the level of districts or
schools within states, [could] threaten not only the comparability of national and
state results with earlier findings, but also the integrity of findings from any
current assessment." Jones also expressed concern that measurement of NAEP
trends could be made impossible if ways were found to increase student motiva-
tion on NAEP. The proposed Voluntary National Test could have similar effects.
These sorts of changes in the context of NAEP would directly affect main NAEP,
but might not affect a separate long-term trend assessment. Therefore, designs
that assess trends using a separate long-term trend NAEP (Designs 1 and 2
presented here) could be more robust to changes in the context of NAEP than are
the designs that use main NAEP to assess long-term trends (Designs 3 and 4
presented here).
Two questions were posed in the first section of this paper: How can a single
assessment be developed that is stable enough to provide long-term trends while
still being flexible enough to adapt to changes in assessment approaches? How
can such an assessment be implemented without losing the current long-term
trend line? Only two of the four designs discussed address both of these ques-
tions: Design 3: eliminate long-term trend NAEP and use main NAEP for trend
assessment, and Design 4: eliminate long-term trend NAEP, and maintain two
main NAEPs for trend assessment.
Both designs require conducting complex linking studies, making strong
statistical assumptions, and being supported by an extensive research program for
developing linking procedures that work in the NAEP context. The outcome of
this research program is difficult to predict. Possibly, procedures could be devel-
oped that allow for linking assessments as different as long-term trend NAEP and
main NAEP or as different as new and old main NAEP. However, it is also
possible that the results of the research will indicate that changes in main NAEP
assessments need to be much more tightly constrained than is presently the case.
A research program could begin with existing long-term trend NAEP data and
main NAEP data for those years in which the two assessments were administered
during the same year in the same subject areas. However, special data collections
certainly are needed in the process of developing the necessary linking proce-
dures. Although safer than Design 3 in that trends will not be lost as easily
OCR for page 149
MICHAEL J. KOLEN
149
because of the use of overlapping assessments, Design 4 requires that two assess-
ments be maintained.
Another potentially useful alternative is: Design 2: periodically update
long-term trend NAEP while maintaining main NAEP. In this design, main
NAEP is allowed to change, in small ways, to better reflect current curncula. The
design requires that assessment specifications be developed to ensure that the
alternate forms of long-term trend NAEP measure the same constructs in similar
ways. It improves on current procedures by allowing for the introduction of new
exercises but still provides stable estimation of long-term trends. No extensive
research program is required to develop and evaluate new linking methodology.
Instead, equating designs that have been used extensively in a variety of assess-
ment programs are used to ensure that long-term trends can be maintained. As
suggested earlier in this section, this design might be more robust than Designs 3
or 4 to changes in the context of NAEP assessments. For these reasons, even
though it does not eliminate the separate long-term trend NAEP and though it
requires maintaining the current long-term trend NAEP, Design 2 deserves fur-
ther consideration.
ACKNOWLEDGMENTS
The author thanks Rodenck Little and two anonymous reviewers for their
comments on a draft of this paper.
REFERENCES
ACT
1997 ACTAssessment Technical Manual. Iowa City, Iowa: ACT.
Allen, N.L., D.L. Kline, and C.A. Zelenak
1996 The NAEP 1994 Technical Report. Washington, D.C.: National Center for Education
Statistics.
Ballator, N.
1996 The NAEP Guide, Revised Edition. Washington, D.C.: National Center for Education
Statistics.
Barron, S.I., and D.M. Koretz
1996 An evaluation of the robustness of the National Assessment of Educational Progress trend
estimates for racial ethnic subgroups. Educational Assessment 3(3):209-248.
Beaton, A.E., and R. Zwick
Brennan, R.L., ed.
1990 The Effect of Changes in the National Assessment: Disentangling the NAEP 1985-86
Reading Anomaly. No. 17-TR-21. Princeton, N.J.: Educational Testing Service.
1989 Methodology Used in Scaling the ACTAssessment and P-ACT+. Iowa City, Iowa: ACT.
Campbell, J.R., K.E. Voelkl, and P.L. Donahue
1997 NAEP 1996 Trends in Academic Progress. Washington, D.C.: National Center for Edu-
cation Statistics.
Donlon, T., ed.
1984 The College Board Technical Handbook for the Scholastic Aptitude Test and Achieve-
ment Tests. New York: College Entrance Examination Board.
OCR for page 150
150
ISSUES IN PHASING OUT TREND NAEP
Forsyth, R., R. Hambleton, R. Linn, R. Mislevy, and W. Yen
1996 Design Feasibility Team Report to the National Assessment Governing Board. Washing-
ton, D.C.: National Assessment Governing Board.
Glaser, R., R. Linn, and G. Bohrnstedt
1996 Letter to Roy Truby from the National Academy of Education panel on the evaluation of
the NAEP trial state assessment project. February 23.
1997 Assessment in Transition: Monitoring the Nation's Educational Progress. Stanford,
Calif.: National Academy of Education.
Goldstein, H.
1983 Measuring changes in educational attainment over time: Problems and possibilities. Jour-
nal of Educational Measurement 20(4):369-377.
Jones, L.V.
1996 A history of the National Assessment of Educational Progress and some questions about
its future. Educational Researcher 25(7): 15-22.
Kolen, M.J., and R.L. Brennan
1995 Test Equating: Methods and Practices. New York: Springer-Verlag.
Koretz, D.M.
1991 State comparisons using NAEP: Large costs, disappointing benefits. Educational Re-
searcher 20(3): 19-21.
Linn, R.L.
1993 Linking results of distinct assessments. Applied Measurement in Education 6:83-102.
Mislevy, R.J.
1992 Linking Educational Assessments: Concepts, Issues, Methods, and Prospects. Princeton,
N.J.: ETS Policy Information Center.
Mullis, I.V.S.
1997 Optimizing State NAEP: Issues and Possible Improvements. Report commissioned by
the NAEP Validity Studies Panel. American Institutes of Research: Palo Alto, Calif.
National Academy of Education
1993 The Trial State Assessment: Prospects and Realities. Stanford, Calif.: National Acad-
emy of Education.
National Assessment Governing Board (NAGB)
1996 Policy Statement on Redesigning the National Assessment of Educational Progress.
Washington, D.C.: NAGB.
1997 Schedule for the National Assessment of Educational Progress. Washington, D.C.:
NAGB.
National Research Council
1996 Evaluation of "Redesigning the National Assessment of Educational Progress." Com-
mittee on Evaluation of National and State Assessments of Educational Progress, Board
on Testing and Assessment. Washington, D.C.: National Academy Press.
O'Sullivan, C.Y., C.M. Reese, and J. Mazzeo
1997 NAEP 1996 Science Report Card for the Nation and the States. Washington, D.C.:
National Center for Education Statistics.
Phillips, G.W.
1991 Benefits of state-by-state comparisons. Educational Researcher 20(3):17-19.
Rust, K.
1996
Sampling issues for redesign. Memorandum to Mary Lyn Bourque, NAGB, May 9.
Spencer, B.
1996 Combining State and National NAEP. Paper prepared for the evaluation of state NAEP
conducted by the National Academy of Education.
OCR for page 151
MICHAEL J. KOLEN
151
Zwick, R.
1991 Effects of item order and context on estimation of NAEP reading proficiency. Educa-
tional Measurement: Issues and Practice 10:10-16.
1992 Statistical and psychometric issues in the measurement of educational achievement trends:
Examples from the National Assessment of Educational Progress. Journal of Educa-
tional Statistics 17(2):205-218.
Representative terms from entire chapter:
main naep