Below are the first 10 and last 10 pages of uncorrected machineread text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapterrepresentative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 101
Appraising the Dimensionality of the 1996
Grade ~ NAEP Science Assessment Data
Stephen G. Sireci, H. Jane Rogers,
Hariharan Swaminathan, Kevin Meara, and
Fre'4e'ric Robin
The science assessment of the 1996 National Assessment of Educational
Progress (NAEP) represents significant advances in largescale assessment. In
particular, this assessment featured carefully constructed "handson" performance
tasks considered to better measure realworld science knowledge and skills.
Furthermore, like the other subject tests in the NAEP battery, these assessments
used comprehensive and innovative sampling, scoring, and scaling procedures.
To document the science knowledge and skills of our nation' s students, great care
was taken in operationally defining the science domains to be measured on the
assessment. For the 1996 grade 8 science assessment, which is the focus of this
paper, three separate score scales were derived for three separate fields of sci
ence earth science, life science, and physical science.
The purposes of the research presented here were to evaluate the structure of
the item response data gathered in the 1996 NAEP science assessment and com
pare this structure to the one specified in the framework that governed the test
development process. The dimensions composing this framework are described
in detail by the National Assessment Governing Board (1996) as well as by Sireci
et al. (Chapter 4, this volume). In brief, the framework specified four dimen
sions: "fields of science" (a content dimension), "ways of knowing and doing
science" (a cognitive skill dimension), "themes of science," and "nature of sci
ence." The first dimension is particularly important for evaluating the structure
of the assessment data because each item in the assessment was linked to one of
the three fields of science, and separate score scales were derived for each field.
Thus, this first dimension was influential in determining how the test booklets
101
OCR for page 101
102 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA
were constructed, how the booklets were spiraled during test administration, and
how the scores were derived to report the results.
The word dimension as used in the NAEP frameworks refers to theoretical
components that provide a structure for describing what NAEP tests and items
measure. However, dimension has several different meanings in the psychometric
literature (Brennan, 1998~. For example, a dimension could be defined statisti
cally as a latent variable that best accounts for covariation among test items.
Sireci (1997) points out that these two different conceptualizations of test
dimensionality should be related to one another. In summarizing dimensionality
issues related to NAEP, he concludes that "there is an absence of research relat
ing the theoretical dimensions specified in the content frameworks to the empirical
dimensions arising from analysis of item response data" (p. i).
The present study represents an assessment of the dimensionality of the 1996
grade 8 NAEP science assessment data. The purposes motivating this research
are straightforward and specific. In scaling the data for this assessment, the
contractor (Educational Testing Service, ETS) used unidimensional item response
theory (IRT) models fit separately to each of the three fields of science. Thus, the
intended structure of this assessment comprised three unidimensional scales, one
each for the earth, life, and physical sciences. The analyses carried out were
aimed at evaluating whether the observed item responses conformed to this
intended structure. A further purpose of the analyses was to determine if system
atic sources of multidimensionality (that would threaten the validity of the IRT
scaling procedure) were present in these data. These analyses were aimed at
gathering critical evidence for evaluating the validity of inferences derived from
the NAEP scores.
METHOD
Data
A comprehensive set of analyses was performed on the data obtained from
the 1996 grade 8 NAEP science assessment. Item response data (test data) were
available for 11,273 students. The item pool comprised 189 items partitioned
into 15 blocks. Each student responded to three blocks of test items, one of which
comprised items associated with one of the four handson tasks. These data were
provided by the ETS and were the same data used for scoring, scaling, and reporting
the results. A description of these blocks in terms of the item types, item content
and cognitive specifications, and sample sizes is presented in Table 51.
The results for the 1996 grade 8 NAEP science assessment were reported on
a composite score scale, which was a weighted composite of the three fields of
science scales. Thus, there are four score scales of interest in evaluating the
assessment: the composite score scale and the earth, physical, and life sciences
scales.
OCR for page 101
S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 103
TABLE 51 Composition of Item Blocks on Grade 8 NAEP Science
Assessment
Number of Items Categorized as
Field of Science Ways of Knowing Item Format
Block Earth Life Physical CU PR SI MCCR Hands on? N
S3 6 1 5 6 Yes 2,961
S4 5 1 3 3 1 5 3 6 Yes 2,739
S5 7 4 3 7 Yes 2,711
S6 2 4 4 2 6 Yes 2,861
S7 12 7 2 3 2 10 No 2,401
S8 10 9 1 5 5 No 2,424
S9 13 10 3 3 10 No 2,401
S10 6 6 4 10 3 3 8 8 No 1,784
S 11 7 2 7 8 6 2 8 8 No 1,797
S 12 3 7 6 11 2 3 8 8 No 1,806
S 13 6 4 5 7 4 4 8 7 No 1,947
S 14 3 5 8 13 3 7 9 No 2,412
S 15 4 5 6 8 3 4 6 9 No 1,836
S20 6 4 6 8 7 1 8 8 No 1,939
S21 3 6 7 10 5 1 7 9 No 1,797
Total: 62 65 62 108 43 36 73 116
Correlational Analyses
Raw Score Correlations
For each test booklet, correlations were computed among the raw scores for
the field of science content areas. These raw scores were computed by summing
the item scores for those items in a booklet that corresponded to the same content
area. These correlations based on the raw score metric are not representative of
the scaled scores for each field of science derived using IRT. However, the
correlations do provide a preliminary and straightforward indication of the simi
larities among the three fields of science. High correlations among these
subscores (e.g., .9 or higher) would provide evidence that the same proficiencies
are being measured by the respective fields of science. On the other hand,
moderate correlations would suggest that more unique proficiencies were being
measured. Both raw correlations and correlations corrected for unreliability were
examined. To obtain disattenuated correlations among the subscores, coefficient
alpha reliabilities were used.
OCR for page 101
104 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA
IRTDerived Theta Correlations
While correlations among the raw subscores present useful information
regarding the structure of the data, correlations among the IRTderived ability, or
theta, scores may be more appropriate for examining the dimensionality of the
scaled scores since NAEP analyses are based on these derived scores. To deter
mine these ability scores, the items comprising each block were calibrated sepa
rately using IRT (see description below). Subsequently, these item parameters
were used to compute a "block proficiency estimate" (block theta estimate) for
each student. Because each student responded to three item blocks, three separate
theta estimates were computed. The correlations among these theta estimates
were compared with the content composition of each block. The logic motivating
this analysis was that if high (disattenuated) correlations were observed among
blocks that measured more than one field of science, evidence was obtained that
the three fields were measuring one general dimension of science proficiency.
On the other hand, if the correlations based on blocks measuring different fields
of science were substantially lower than those measuring the same field of science,
evidence of relatively unique dimensions measured by each field would be
obtained. Thus, we were interested in both the magnitude and the pattern of these
correlations. The theta correlations were disattenuated (corrected for unreliability)
using the marginal reliabilities estimated in the calibration in each block. As
described below, the sample sizes per block were sufficient for estimating indi
vidual student thetas. However, our blocklevel scaling treated each block as if it
measured a single latent trait, thus ignoring the explicit scaling structure used in
the operational NAEP scaling. Another potential limitation of this analysis is that
there may be too few items per field within a block to provide unique variance
associated with that field. Nevertheless, inspecting these correlations with the
expectations described above provided a different lens through which to view the
idea of composite and separate science proficiency scales.
Principal Components Analysis
As a preliminary check on dimensionality, data from four test booklets were
analyzed using principal components analysis (PCA). PCA could not be used to
simultaneously evaluate the dimensionality of the whole set of 189 items because
of the balanced incomplete block (BIB) spiral design. The four booklets chosen
(numbers 209, 210, 231, and 232) involved 12 of the 15 blocks (152 of the 189
items) and included all four handson tasks. Separate PCAs were conducted on
each booklet. The eigenvalues associated with the extracted components and the
percentages of variance in the item data accounted for by these components were
used to evaluate the dimensionality of each booklet.
OCR for page 101
S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 105
IRT Residual Analyses
The fit of IRT models to the data was evaluated directly by calibrating each
block using a unidimensional IRT model. The decision to calibrate each block
separately was motivated by sample size considerations (i.e., the bookletlevel
sample sizes were too small for IRT scaling) and the presence of large blocks of
incomplete data in the studentbyitem matrix (11,273 students by 189 items).
Because each student responded to only about 36 items on average, the entire
pool could not be calibrated concurrently due to the inability to properly estimate
the interitem covariance matrix. In the operational scaling of NAEP, this prob
lem is overcome by using the plausible values methodology (i.e., by conditioning
calibration on a comprehensive vector of covariates derived from student back
ground variables; see Mislevy et al., 1992, for more complete details of the
NAEP scaling methodology). This conditioning was not possible given the time
and software limitations of this study. Thus, these blockspecific calibrations
evaluated modeldata fit in a manner independent of the plausible values method
ology. If the data comprising a block are essentially unidimensional, these IRT
calibrations should exhibit good fit to the data. As can be seen from Table 51,
the sample sizes were appropriate for calibrating each block using an IRT model.
The smallest sample size was 1,784, and the largest number of parameters esti
mated in any of the calibrations was 49.
All IRT calibrations were conducted using the computer program MULTILOG,
version 6.1 (Thissen, 1991~. The multiplechoice items were calibrated using a
threeparameter IRT model (3p),1 and short constructedresponse items that were
scored dichotomously were calibrated using a twoparameter IRT model (2P).
These models were identical to those used by the ETS in calibrating these same
items. For the constructedresponse items that were scored polytomously (i.e., a
student could earn a score greater than one), Samejima's (1969) graded response
(GR) model was used. The GR model is similar but not equivalent to the
Generalized Partial Credit (GPC) model (Muraki, 1992) used by ETS to calibrate
these items. In both the GPC and the GR models, a common slope (discrimina
tion) parameter is assumed for the response functions of each item score category
while separate threshold (location) parameters are assumed for each score cat
egory. However, because of the dependency that exists among the threshold
parameters (i.e., the choice of the first k1 categories determines whether an
examined chooses the last category), the number of location parameters for an
item is one less than the number of response categories. For example, a
constructedresponse item scored from zero to three (i.e., four response catego
1 For the 3P models, priors were used on the c parameters, where the prior was equivalent to the
reciprocal of the number of response options for each item. The effect of these priors was evaluated
by also calibrating the items without the priors. The results were very similar, which was not
surprising given the relatively large sample sizes.
OCR for page 101
106 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA
ries) is modeled using four parameters: one common slope parameter for each
score category and three location parameters.
Although calibrating the polytomously scored items with the OR model
using MULTILOG differs from the GPC model fitted using PARS CALE (which
was used by ETS), the effects of this difference are considered to be minimal
given the purpose of the analyses (i.e., to determine departure of the response
data from unidimensionality). MULTILOG was used in this study because the
modified version of PARS CALE used by ETS to calibrate the NAEP items was
not publicly available.
To evaluate IRT modeldata fit, a residual analysis was performed using the
program POLYFIT (Rogers, 1996~. The POLYFIT program uses the item and
person parameter estimates obtained from MULTILOG to compute the expected
score for examiners at a given proficiency (theta) level. These expected scores
are compared with the corresponding average observed scores and residuals are
computed. Specifically, the group of examiners is divided into 12 equal theta
intervals constructed in the range (mean theta + 3 standard deviations), with
interval width equal to .5 standard deviations. The midpoint of each interval is
used to calculate the expected score in that interval, where k is the category score
and P(k) is the probability that an individual with given theta will score in cat
egory k. The difference between the average observed score and the expected
score in each interval is computed. This residual is then standardized by dividing
by the standard error, which is obtained from the standard deviation of the dis
crete random variable
~k2P(k)  t~kP(k)~ ,
where k and P(k) are defined above. The standardized residuals computed at the
score level are analogous to those routinely computed for dichotomous IRT
models by comparing observed and expected proportions correct. Standardized
residuals are reported only for cells with a frequency of 10 or more. The stan
dardized residuals may be examined for each item to assess the fit of individual
items. In addition, a frequency distribution of the standardized residuals over
items is provided as a summary of the overall fit of the model.
When the model fits the data, the distribution of standardized residuals should
be symmetric with a mean close to zero. While there is no theory to show that the
residuals are normally distributed when the model fits the data, it is reasonable to
expect a roughly normal distribution, with few standardized residuals with an
absolute value greater than three.
A chisquare statistic is also calculated using observed and expected fre
quencies of examiners in each score category. Expected frequencies are obtained
by calculating the probability that an examiner at the midpoint of each theta
interval would score in each response category. Results of the chisquare analysis
OCR for page 101
S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 107
should be interpreted with caution. The chisquare statistic is at best only
approximately distributed as a chisquare; it has the usual failings of IRT chi
square fit statistics in that it is sensitive to sample size, the arbitrary nature of the
theta intervals, and heterogeneity in the theta levels of examiners grouped in the
same interval. Hence, it should be used only descriptively and the significance
level ignored.
It was not hypothesized that all of the 15 blocks could be fit adequately using
a single unidimensional scale. In fact, such a hypothesis is contrary to the scaling
models used to score the assessment. As seen in Table 51, all blocks do not
comprise items from a single field of science. Those blocks that do comprise
items from a single field of science (blocks 3, 5, 7, 8, and 9) should conform to a
unidimensional scale (i.e., exhibit relatively small, normally distributed standard
ized residuals). Conversely, those blocks containing items from more than one
field may, unsurprisingly, depart from unidimensionality (i.e., exhibit relatively
larger, nonnormally distributed standardized residuals). Thus, the hypotheses
motivating our block calibration/residual analysis evaluations involved compar
ing the results of the residual analyses with the a priori expectations of dimen
sionality given the contentarea designations of the items composing a given
block. More specifically, if blocks containing items from only one field of
science exhibited small residuals and the blocks containing items from two or
three fields exhibited larger residuals, evidence of three separate scales corre
sponding to the three fields of science specified in the framework would be
obtained.
Factor Analyses
Factor analyses (FAs) were also conducted to evaluate the dimensionality of
each block. Evaluation of the dimensionality of each block using FA provides an
independent assessment of dimensionality from that obtained by assessing the fit
of an unidimensional IRT model to the data. It should be pointed out that FA is
appropriate only when the relationship between item responses and the under
lying trait is linear. When the relationships between the item responses and the
underlying traits are nonlinear, procedures based on nonlinear factor models are
necessary. Item response theory is an example of a nonlinear factor analysis
procedure and is the procedure of choice for evaluating the dimensionality of
nonlinear data. The problem is that currently only unidimensional IRT models
(for dichotomous and polytomous responses) have been developed with commer
cially available software. Multidimensional IRT models have been proposed, but
these do not have the necessary software for data analysis. One exception is the
nonlinear factor analysis procedure developed by McDonald (1967) in which
nonlinear trace lines are approximated by polynomials. The computer program
NOHARM implements this procedure; however, the program is not designed to
handle polytomous data. Given these considerations, the linear factor model was
OCR for page 101
108 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA
used as an approximation to nonlinear models to evaluate the dimensionality of
the data especially when the hypothesis that several dimensions underlie the
. · ~
responses IS examlnea.
A onefactor model was fit to the data for each block of items. If the blocks
contained items from two fields, a twofactor model was fit to the data with items
from each field constrained to load on two separate factors. If the blocks con
tained items from all three fields, a constrained threefactor model was fitted to
the data. Because of constraints, the two and threefactor models are analyzed
using the confirmatory, rather than exploratory, factor analysis procedure; there
is no distinction between confirmatory and exploratory procedures with the one
factor model.
The analyses were carried out using the LISREL 8 computer program
(Joreskog and Sorbom, 1993~. The correlation matrix analyzed was based on
product moment as well as tetrachoric and polyserial correlations. When the
correlation matrix was based on the tetrachoric or the polyserial correlations, the
generalized least squares procedure rather than the maximum likelihood proce
dure was used when the correlation matrix was not positive definite.
The fit of the model was evaluated by examining the goodness of fit index
(GFI), adjusted goodness of fit index (AGFI), and residuals rather than the likeli
hood ratio statistic. When the data are nonlinear, particularly when they are
nonnormal, and when tetrachoric/polyserial correlations are used, the likelihood
ratio statistic is unreliable. The GFI and the AGFI provide adequate assessments
of dimensionality (Tanaka and Huba, 1985~. For this study, values of GFI and
AGFI greater than .90 were taken as indications of adequate fit of the model.
Multidimensional Scaling Analyses
Multidimensional scaling (MDS) was used to evaluate the dimensionality of
all the dichotomously scored items. These analyses followed the unidimension
ality testing procedure developed by Chen and Davison (1996~. This procedure
involves computing pseudopaired comparison (PC) statistics that represent the
similarity between two dichotomously scored items, as determined from examin
ees' performance on the items. Given this restriction, the MDS analyses were
conducted using only the multiplechoice items and those short constructed
response items that were also scored dichotomously. Chen and Davison recom
mend fitting one and twodimensional MDS models to the matrix of item PC
statistics and comparing the results. If the onedimensional model fits the data
well, the coordinates correlate highly with the item difficulties (p values), and an
eshape or ushape pattern is observed in two dimensions (suggesting overfilling
the data), the data can be considered unidimensional. This comparison is qualita
tive, rather than relying on a statistical index. Two descriptive fit indices were
used to evaluate fit of the MDS models to the data: STRESS and R2. The
STRESS index represents the square root of the normalized residual variance of
OCR for page 101
S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 109
the monotonic regression of the MDS distances on the transformed PC statistics.
Thus, lower values of STRESS indicate better fit. The R2 index reflects propor
tion of variance of the transformed data accounted for by the MDS distances.
Thus, higher values of R2 indicate better fit. In general, STRESS values near or
below .10 and R2 values of .90 or greater are indicative of reasonable datamodel
fit. There were 91 dichotomously scored items (73 multiplechoice and 18 short
constructedresponse items) analyzed using MDS.
RESULTS
Principal Components Analysis
As mentioned above, data from four test booklets were analyzed using prin
cipal components analysis (PCA). These four booklets (booklets 209, 210, 231,
and 232) involved 12 of the 15 blocks (152 of the 189 items) and included all four
handson tasks. Separate PCAs were conducted on each booklet. The number of
items composing each booklet ranged from 33 to 40. The sample sizes for each
booklet were approximately the same, ranging from 274 to 284.
Booklet 209 comprised 38 items from blocks S3, Sll, and S12: 10 earth, 9
life, and 19 physical science items (16 multiplechoice and 22 constructed
response items). The first principal component (eigenvalue = 12.4) accounted for
33 percent of the variance. However, the second factor was also relatively large
(eigenvalue = 5.8) and accounted for 15 percent of the variance. Inspection of the
unrelated component (factor) loading matrix revealed 10 items with loadings
below .3 on the first factor. These items came from different blocks and content
areas, but all were constructedresponse items. (Five of these items had loadings
larger than .30 on the second factor.) The scree plot for booklet 209 is presented
in Figure 51.
Booklet 210 comprised 40 items from blocks S4, S13, and S14: 14 earth, 10
life, and 16 physical science items (18 multiplechoice and 22 constructed
response items). The first principal component (eigenvalue = 11.0) accounted for
28 percent of the variance, and the second principal component (eigenvalue =
3.3) accounted for 9 percent of the variance. Inspection of the unrelated factor
loadings revealed three items with loadings less than .3 on the first factor. Two
items came from block S4; the other was from block S13. All three items were
earth science items. One item was a constructedresponse item from block S13;
the other two were from block S4, one of which was a multiplechoice item. The
scree plot for this booklet is presented in Figure 52.
Booklet 231 comprised 40 items from blocks S5, Sin, and S21: 17 earth, 12
life, and 11 physical science items (15 multiplechoice and 25 constructed
response items). The first principal component (eigenvalue = 17.0) accounted for
45 percent of the variance and the second principal component (eigenvalue = 4.0)
for 11 percent. Inspection of the unrelated factor loadings revealed five items
OCR for page 101
110 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA
20
18
16
14
124

~ 1n
.~
o
37
.
1 5 9 13 17
Component Number
FIGURE 51 Scree plot from PCA for booklet 209.
20
18
16
14
12
10
. _
8
6
4
2
O _
21 25 29 33
1 5
_+ _ . ~
9 13 17 21 25 29 33 37
Component Number
FIGURE 52 Scree plot from PCA for booklet 210.
OCR for page 101
S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 1l l
with loadings less than .3 on the first factor: four constructedresponse earth
science items from block S5 and a constructedresponse life science item from
block S 10. Only one of the five constructedresponse items had a relatively large
loading on the second factor. The scree plot for this booklet is presented in
Figure 53.
Booklet 232 comprised 33 items from blocks S6, S7, and S15: 16 earth, 7
life, and 10 physical science items (8 multiplechoice and 25 constructedre
sponse items). The first principal component (eigenvalue = 9.3) accounted for 28
percent of the variance and the second principal component (eigenvalue = 4.9) for
15 percent. Inspection of the unrelated factor loadings revealed five items with
loadings of less than .3 on the first factor: three constructedresponse life science
items (one from block S6 and two from block S15) and two physical science
items from block S 15 (one of which was a multiplechoice item). The scree plot
for this booklet is presented in Figure 54.
Using the ratio of the percentage of variance accounted for by the first two
components, booklets 210 and 231 appear to be unidimensional. The first com
ponent accounts for three times as much variance as the second component in
each of these two booklets. A case for unidimensionality may also be made for
all four booklets because of the relatively large percentage of variance accounted
for by the first component (minimum 28 percent). However, a substantial pro
portion of variance is accounted for by the second factor underlying all four
booklets (especially for booklets 209 and 232), and each booklet exhibited some
items with higher loadings on a factor other than the first. Thus, the PCAs
indicate a small degree of multidimensionality in these data. This multidimen
sionality was not linked to content area or cognitive level, but it was noted that
some of the constructedresponse items had small loadings on the first factor. It
should also be noted that PCA has been widely criticized for producing spurious
factors when applied to test score data.
Raw Score Correlational Analysis
The relationship among the three fields of science was also evaluated at the
booklet level by deriving three "contentarea raw scores" for each student. The
correlations among these earth, life, and physical science raw scores were then
calculated. Raw scores derived from booklets containing only a few items corre
sponding to a field of science (specifically, those raw scores that produced a scale
less than 10 points in length) were eliminated from this correlational analysis. In
addition, raw scores with internal consistency (coefficient alpha) reliabilities of
less than .50 were eliminated. This process resulted in 21 correlations among
earth and physical science raw scores, 17 correlations among life and physical
science raw scores, and 15 correlations among earth and life science raw scores.
The 21 earthphysical correlations ranged from .61 to .79, and the median corre
OCR for page 101
112 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA
20
18
16
14
12
it
~10
.~
8
4
t
it_
at l
1 5 9 13 17 21 25 29 33 37
_
17 21 25
Component Numbe
FIGURE 53 Scree plot from PCA for booklet 231.
20
18
16
14
10
. _
12
8
61
4
r
 1 \
2
n
.
1 I_
1 5 9 13
0~ _
17 21 25 29 33
Component Number
FIGURE 54 Scree plot from PCA for booklet 232.
OCR for page 101
S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 113
ration was .69. After disattenuation (correction for measurement error2), these
correlations ranged from .83 to 1.0, and the median correlation was .99. The 17
physicallife correlations ranged from .54 to .73; the median correlation was .64.
After disattenuation, these correlations ranged from .83 to 1.0, with a median
correlation of .97. The 15 earthlife correlations ranged from .53 to .71, with a
median correlation of .62. After disattenuation these correlations ranged from
.83 to 1.0, with a median correlation of .91. The magnitudes of the median
disattenuated correlations (.99, .97, and .91) suggest that the three fields of science
were essentially measuring the same construct. The results of these correlations
are summarized in Table 52.
Results Stemming from IRT Analyses
As described earlier, MULTILOG was used to calibrate each of the 15 sci
ence item blocks. Unfortunately, an unidentifiable problem, internal to
MULTILOG, prevented calibration of block S7 (a block comprising 12 earth
science items). Successful item calibrations were obtained for the other 14 blocks;
however, we were unable to estimate thetas based on students' responses to block
S14 (a block comprising eight physical, five life, and three earth science items).
The marginal reliabilities for the 14 blocks that were calibrated ranged from
.39 (block S6, which was a handson block containing four physical and two life
science constructedresponse items) to .80 (block S4, which was a mixed hands
on block comprising six constructedresponse items and three multiplechoice
items). The median marginal reliability across the 14 blocks was .75.
Correlations Among Separate (Block) Theta Estimates
As the description "balanced incomplete block spiraling design" indicates,
not all of the 15 item blocks were paired with each other. Thus, our analyses of
the blockspecific thetas estimated for each student included all available correla
tions among the blocks that were successfully calibrated and scored using
MULTILOG (except block S6, which exhibited inadequate marginal reliability)
and that were paired together in at least one test booklet. Each student responded
to three blocks of items; thus, three separate "block" thetas were computed for
each student. A total of 56 block theta correlations were computed. There were
no data available for computing correlations among an earth science block and a
2The disattenuated correlations were computed by dividing the raw score correlation by the square
root of the product of the reliability estimates for each contentarea raw score. Because the alpha
coefficient is known to be an underestimate of reliability (Novick, 1966), the disattenuated correla
tions are overestimates and may at times be greater than one. Nine of the 53 disattenuated correla
tions were greater than one: six earthphysical correlations, two earthlife correlations, and one
physicallife correlation. These correlations were truncated to 1.0.
OCR for page 101
114 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA
TABLE 52 Summary of Field of Science Raw Score Correlations
Number of Unadjusted Correlations Disattenuated Correlations
Scores Correlated Correlations Range Median Range Median
Earth and physical 21 .61 to .79 .69 .83 to 1.0 .99
Life and physical 17 .54 to .73 .64 .83 to 1.0 .97
Earth and life 15 .53 to .71 .62 .83 to 1.0 .91
physical science block; however, there were two correlations available for both
earthlife science comparisons and lifephysical science comparisons. The remaining
52 correlations involved seven correlations among an earth science block (block
S5) and "mixed"item blocks (i.e., blocks containing items from all three fields of
science), 23 correlations among life science (blocks S8 or S9) and mixeditem
blocks; seven correlations among a physical science block (block S3) and mixed
item blocks, and 15 correlations among mixeditem blocks. All correlations were
disattenuated using the MULTILOG marginal reliability estimates.
A summary of the thetabased correlational analyses is presented in Table
53. The unadjusted correlations among thetas derived from mixeditem blocks
(15 correlations) ranged from .50 to .79. After correcting for measurement error,
these correlations ranged from .76 (S21, S10) to 1.00,3 with a median correlation
of .87. The range and relatively large disattenuated correlations suggest that
these mixed blocks, containing items from all three fields of science, were prob
ably measuring the same general science proficiency construct. The magnitude
of these correlations among the mixeditem blocks was similar to or higher than
the observed marginal reliabilities for these blocks.
The unadjusted correlations for the earthmixed comparisons ranged from
.40 to .59 and after disattenuation from .58 (S5, S21) to .77 (S5, Sly; the median
disattenuated correlation was .68. The two disattenuated earthlife correlations
were .63 and .72. These correlations are lower than those observed for the
mixedblock correlations, leaving the door open for the conclusion that the earth
science items, at least those included in block S5, do measure a slightly different
construct than general science proficiency.
The unadjusted correlations for the physicalmixed comparisons ranged from
.44 to .60 and after disattenuation from .63 (S3, S21) to .82 (S3, Sll), with a
median disattenuated correlation of .76. The two lifephysical disattenuated
correlations were .72 and .85. These correlations are also relatively low, suggest
ing that the physical science items may also be measuring a somewhat unique
domain of science proficiency.
3Actually, three of the 56 disattenuated correlations were slightly greater than one; all were from
correlations of thetas derived from two mixed blocks.
OCR for page 101
S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 115
TABLE 53 Summary of BlockDerived Theta Correlations
Types of Number of Unadjusted Correlations Disattenuated Correlations
Blocks Correlated Correlations Range Median Range Median
Mixed and mixed 15 .50 to 79 .66 .76 to 1.0 .87
Life and mixed 14 .50 to .70 .64 .73 to .94 .88
Physical and mixed 7 .44 to .60 .56 .63 to .82 .76
Earth and mixed 7 .40 to .59 .51 .58 to .77 .68
Notes: The earth and mixed correlations are between block S5 and mixed blocks; the physical and
mixed are between block S3 and mixed blocks; and the life and mixed are between blocks S8 or S9
and mixed blocks. Mixed blocks contain items from all three fields of science.
The unadjusted correlations for the lifemixed comparisons ranged from .50
to .70. The disattenuated correlations ranged from .73 (S5, S8) to .94 (S9, S20),
and the median correlation was .88. The magnitudes of the disattenuated correla
tions suggest that the life science items (blocks S8 and S9) may be more closely
related to general science proficiency than the earth and physical science items.
POLYFIT Analyses
The fit of the IRT models for each block was evaluated using POLYFIT
(Rogers, 1996~. Distributions of the standardized residuals generated from the
POLYFIT program are presented in Table 54. Estimates could not be obtained
for four of the 15 blocks (S4, S7, S9, and Sib. For blocks S10 through S21 the
unidimensional IRT models appear to fit the data adequately. The residual analy
TABLE 54 Distribution of Standardized Residuals for Each Block
Theta Interval
Block
3 to2 2to1 ltoO Otol lto2 2to3 >3 Mean
S3 4.55 0 4.55 42.42 42.42 1.52 1.52 3.03 0.05
S5 3.90 1.30 2.60 40.26 40.26 1.30 2.60 7.79 0.13
S6 1.67 1.67 5.00 33.33 48.33 3.33 1.67 5.00 0.11
S8 11.00 3.00 7.00 28.00 30.00 13.00 6.00 2.00 0.15
S10 2.50 3.13 5.63 36.25 36.25 9.38 5.63 1.25 0.04
S 11 3.13 3.75 6.88 35.00 37.50 8.75 3.75 1.25 0.04
S 12 2.78 3.47 5.56 32.64 40.97 9.72 2.78 2.08 0.01
S13 2.96 3.70 8.89 23.70 39.26 8.89 8.89 3.70 0.11
S15 2.67 2.67 5.33 39.33 36.00 12.00 2.00 0 0.04
S20 3.13 3.13 10.00 31.88 31.88 16.25 3.13 0.63 0.04
S21 1.88 0.63 7.50 41.88 38.13 6.88 2.50 0.63 0.02
Notes: Table entries are percentages of residuals falling within each interval.
Obtained for blocks 4, 7, 9, and 14.
Estimates could not be
OCR for page 101
116 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA
ses show that most residuals are close to zero, with only a small proportion (no
more than about 5 percent) falling outside the range (3,3~.
For blocks S3, S5, S6, and S8, the model did not fit as well. Blocks S3, S5,
and S6 consist of performance tasks. These blocks have six, eight, and six items,
respectively. Blocks S3 and S5 contain items from only one scale, while block
S6 has four physical items and two life items. Block S8 consists of 10 items, all
measuring life science.
For block S3, examination of the residuals reveals that most of the large ones
were obtained from item 1. This was the only dichotomously scored item in the
block, fitted using the twoparameter model. For this item the residuals tended to
be negative at the low end of the proficiency continuum and positive at the high
end, suggesting that the aparameter may have been underestimated. In block S5,
items 5 and 7 appeared to fit poorly. Both of these items were dichotomously
scored and fitted using the twoparameter model. Item 5 showed no clear pattern
in the residuals, while item 7 produced large positive residuals at the upper theta
levels. In block S6, item 4 yielded poor fit. This item, again, was a dichotom
ously scored item, fitted using the twoparameter model. It was a very difficult
item. The residuals showed the same pattern as was observed for the other poorly
fitting items. For block S8 all of the dichotomously scored items (5 MC, 1 2P)
showed some degree of misfit, with the largest (negative) residuals occurring at
the low end of the proficiency continuum. A summary of the POLYFIT analyses
is presented in Table 55.
The summary of the POLYFIT analyses presented in Table 55 illustrates
that the results were contrary to our expectations. We expected blocks compns
ing items from only one field of science to be fit well by the unidimensional IRT
models, and blocks comprising items from more than one field of science not to
be fit well by these models. However, the opposite pattern emerged. Blocks
TABLE 55 Summary of POLYFIT Results
Block Content Item Types Expectation Result Problem Items
S3 P 6 CR Good fit Poor fit 1 2P item
S5 E 8 CR Good fit Poor fit 2 2P items
S6 L, P 6 CR Poor fit Poor fit 1 2P item
S8 L 5 MC, 5 CR Good fit Poor fit 5 MC, 1 2P item
S10 E, L, P 8 MC, 8 CR Poor fit Adequate fit
Sll E, L, P 8 MC, 8 CR Poor fit Adequate fit
S12 E, L, P 8 MC, 8 CR Poor fit Adequate fit
S13 E, L, P 8 MC, 8 CR Poor fit Adequate fit
S15 E, L, P 6 MC, 9 CR Poor fit Adequate fit
S20 E, L, P 8 MC, 8 CR Poor fit Adequate fit
S21 E, L, P 7 MC, 9 CR Poor fit Adequate fit
Notes: E = earth science, L = life science, P = physical science; CR = constructedresponse item,
MC = multiplechoice item; 2P = dichotomously scored constructedresponse item.
OCR for page 101
S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 117
comprising items from all three fields of science were fit adequately using IRT,
and those blocks comprising items from a single field of science exhibited rela
tively poor fit.
Factor Analyses
A summary of the results of the factor analyses is presented in Table 56. A
onedimensional factor model was fit to each block. Fortunately, the results
obtained with the Pearson productmoment correlations and the tetrachoric/
polyserial correlations did not differ substantially; hence, only the results based
on the productmoment correlations are provided. The goodnessoffit indices
(GFI and AGFI) were used to evaluate the modeldata fit. The modeldata fit was
considered to be reasonable when GFI and AGFI were equal to or greater than
.90. As shown in Table 56, 12 of the 15 blocks were adequately fit using the
onefactor model, indicating that the data can be considered unidimensional.
Blocks S3 and S14 came close to meeting the fit criterion the GFI for both
blocks exceeded .90, but the AGFI was .89 for both blocks. The only block that
did not meet the fit criterion was block S5, which was a block of handson earth
science items the GFI was .84, while the AGFI was .72. The other handson
tasks (blocks S3, S4, and S6) fitted the onefactor model adequately. Since S5
was made up of items from one content area, a multifactor model was not fit to
the item responses. Given the high fit index values obtained with the onefactor
model for all of the blocks, the acceptable fit values obtained with S3 and S14,
TABLE 56 Summary of Confirmatory Factor Analysis Results
Block Content Item Types GFI/AGFI Areas
S3 P CR .95/.89
S4 E,L,P CR,MC .95/.92
S5 E CR .84/.72
S6 L,P CR .99/.99
S7 E CR,MC .99/.98
S8 L CR,MC .99/.99
S9 L CR,MC .99/.99
S10 E,L,P CR,MC .99/.98
Sll E,L,P CR,MC .98/.98
S12 E,L,P CR,MC .98/.98
S13 E,L,P CR,MC .99/.96
S14 E,L,P CR,MC .91/.89
S15 E,L,P CR,MC .99/.98
S20 E,L,P CR,MC .99/.98
S21 E,L,P CR,MC .99/.98
Notes: E = earth science, L = life science, P = physical science, CR = constructedresponse,
MC = multiplechoice.
OCR for page 101
118 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA
and the fact that S5 was comprised of items from a single content area, only the
results obtained from fitting a onefactor model are presented in Table 56. Never
theless, two and threefactor confirmatory factor analyses were carried out for
block S14. Unfortunately, the two and threefactor solutions did not converge
for S14, and hence the improvement in fit that may have resulted from fitting
multifactor models could not be examined.
MDS Analysis of All Dichotomous Items
As described earlier, the PC statistic suggested by Chen and Davison (1996)
provides a formal analysis of unidimensionality of dichotomous test data using
MDS. Because of its limitation to analysis of only dichotomously scored items,
we applied the procedure to only the multiplechoice and dichotomously scored
short constructedresponse items. Almost half (91 of 189) of the items were
scored dichotomously: 73 multiplechoice items and 18 short constructed
response items. Although the results of this analysis cannot be generalized to the
dimensionality of the complete dataset, which includes the polytomously scored
items, it does evaluate whether the 91 dichotomous items can be considered
unidimensional.
The onedimensional MDS solution did not display adequate fit to the data
(STRESS = .20, R2= .88~. The item p values correlated .88 with the one
dimensional coordinates; however, they correlated .93 with the coordinates of the
first dimension from the twodimensional MDS solution. The twodimensional
solution fit the data well (STRESS = .10, R2= .97~. Inspection of the item
coordinates on the second dimension indicated that the four easiest items (with p
values equal to or greater than .87) and the eight most difficult items (with p
values equal to or less than .18) had large negative coordinates on this dimension.
The item standard deviations correlated .98 with the item coordinates on dimen
sion 2. These coordinates were unrelated to item type (multiplechoice or short
constructedresponse), field of science, cognitive area, or other item framework
characteristics. Therefore, although the onedimensional MDS solution did not
fit these data, the second dimension appears to be a statistical artifact and not a
substantive unique dimension.
The ChenDavison procedure was also used to appraise the dimensionality
of the 73 multiplechoice items. The onedimensional model displayed adequate
fit to the data (STRESS = .13, R2= .95~. However, improved fit was obtained
using two dimensions (STRESS = .10, R2= .96), and 10 items exhibited large
negative coordinates on the second dimension. As expected, the coordinates
from the onedimensional solution and those from the first dimension of the two
dimensional solution were highly correlated with the item p values (both r's were
around .99~. Similar to analysis of the 91 dichotomous items reported above,
dimension 2 corresponded to the extremely easy or extremely difficult items.
OCR for page 101
S.G. SIRECI, a. ROGERS, H. SWAMINATHAN, K MEA^, AND F. ROBIN 119
Thus, the MDS analysis of the PC statistics for the multiplechoice items suggests
that these items are essentially unidimensional.
DISCUSSION
This study involved several different data analytic strategies for evaluating
the dimensionality of the grade 8 NAEP science item response data. Some
consistencies were observed across these analyses. For the most part, unidimen
sional models displayed adequate fit to the data. When multidimensionality was
observed, it was generally linked to a few items in a block. The analyses most
supporting unidimensionality of the data were the FA, the MDS analyses of the
dichotomous items using the PC statistic, and the disattenuated field of science
raw score correlations. The PCA and the POLYFIT analyses identified some
booklets or blocks that were not fit well using a unidimensional model. The
observed multidimensionality was not linked to differences among the fields of
sciences or other content characteristics of the items. However, the POLYFIT
results indicated poorest fit for the dichotomously scored constructedresponse
items from the three handson task blocks, as well as for all of the dichotomously
scored items from block S8.
The results of the thetabased correlations are difficult to interpret. The
correlations observed among the "mixed" blocks (blocks comprising items from
all three fields of science) were larger than those observed among blocks com
prising items from a single field of science. This finding could be taken as
evidence of multidimensionality in the data resulting from the field of science
designations of the items. However, there were only four blocks of items com
prising items from a single field of science, and the residuals from IRT models fit
to these blocks were larger than residuals from IRT models fit to mixed blocks
(see Table 55~. Therefore, it is difficult to conclude that these lower correlations
are due to field of science content distinctions. It is noteworthy to reiterate that
the blocklevel IRT calibrations we conducted differ from the fieldofscience
specific IRTderived scale scores used in the operational scoring of NAEP. Thus,
the nature of the different "proficiencies" (i.e., thetas) resulting from our block
level calibrations is unknown. In general, however, the relatively high correla
tions observed among the mixed blocks suggests that the fields of science are
highly related.
Although not explicitly explored in this study, a potential cause for the small
degree of multidimensionality observed is "local item dependence" (Sireci et al.,
1991; Chen and Thissen, 1997; Yen, 1993~. If students' responses to one item are
determined in part from their responses to another item (e.g., as in a multistep
problem), this interitem dependence could show up as multidimensionality.
Because local item dependence violates the conditional independence assump
tion of IRT, it could affect the plausible theta values computed for students and
the NAEP scale values computed for groups. Thus, evaluating the fit of items
OCR for page 101
120 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA
likely to be locally dependent is an important area for future research. Unfortu
nately, we did not have access to the text for the actual items and so were unable
to determine if some of the larger IRT model residuals or aberrant factor loadings
were due to local item dependence.
Does the scale structure specifying three separate fields of science appear
reasonable? Can the entire NAEP grade 8 science assessment be considered
unidimensional? Even after the comprehensive series of analyses performed
here, unequivocal answers to these questions cannot be provided. It appears that
many of the blocks can be considered unidimensional even though they contain a
mix of items from the three separate fields of science and a mix of multiple
choice and constructedresponse items. For if the three fields of science repre
sented very different proficiencies, we would expect relatively poor fit for the
mixeditem blocks in the POLYFIT and unidimensional FA analyses. However,
for the most part, the blocks were fit well using a unidimensional IRT or an FA
model. The large disattenuated correlations observed among the field of science
raw scores also argues against three separate scales. It is possible that there were
too few items in each content area at the block or booklet level to uncover their
uniqueness, but it is clear that these three fields of science are highly related.
Therefore, reporting the assessment results on a composite score scale certainly
seems appropriate. A more equivocal issue is the necessity of three separate
score scales.
The results of this study suggest that it may be possible to represent the three
fields of science using a unidimensional model. If these three fields do not
represent distinct dimensions and can be calibrated onto a common scale, it is
possible that the number of items required to represent all three fields of science
could be reduced (since separate scales would not need to be calibrated). This
possibility has implications for reducing the size of the item pool and conse
quently increasing the proportion of items taken by each student. This possibility
should be explored further because with fewer items needed to represent general
science proficiency a simpler, more "complete" spiraling design is possible, thus
reducing the necessity for the complex plausible values scaling methodology.
For example, Mislevy et al. (1992) compared plausible values estimation meth
odology with (unconditional) maximum likelihood estimation. Their results sug
gest that, with a sufficient number of items (e.g., 20 or 30), the two procedures
provide comparable results (Sireci, 1997~. On average, each student who took
the 1996 NAEP grade 8 science assessment responded to about 36 items. Thus,
an area of future research is evaluation of the utility of the separate field of
science subscores with respect to information gained beyond the composite score.
Given the strong relationships among the fields exhibited in this study, if the
grade 8 NAEP science results continue to be aggregated and reported only at the
group level, it is unlikely that subscores will provide unique diagnostic
information.
The results from this study are consistent with those of Zhang (1997), who
OCR for page 101
S. G. SIRECI, H.J. ROGERS, H. SWAMINATHAN, K MEARA, AND F. ROBIN 121
analyzed two of the grade 8 science blocks (S14 and S21) using "theoretical
DETECT" and concluded that these mixed blocks were essentially unidimen
sional. In the current study, block 21 displayed adequate fit to a unidimensional
IRT model and displayed adequate fit to the onefactor FA model. Block S14
could not be evaluated using POLYFIT but displayed close fit to the onefactor
FA model. These two blocks contained a mix of multiplechoice items and
constructedresponse items and items from all three fields of science, making
them good candidates for discovering multidimensionality. The fact that both the
Zhang study and the present study were consistent in supporting the unidimen
sionality of these blocks suggests that a unidimensional scale could be used to
represent all three fields and that the different item types are measuring the same
proficiency. However, the present study looked at all 15 blocks, and two areas of
concern were noted: (1) relatively poor fit to an IRT model for three of the four
handson tasks analyzed using POLYFIT and (2) relatively poorer fit for those
constructedresponse items that were scored dichotomously. Whether these
observations reflect realitem type differences or are specific to a small number
of items from the much larger pool should be determined from future research.
It is important to bear in mind that this study only analyzed data from the
1996 grade 8 NAEP science assessment. Thus, the results may not generalize to
the science assessments administered at other grade levels, to other subject tests
in the NAEP battery, or to other NAEP tests administered in different years.
ACKNOWLEDGMENTS
The authors thank Karen Mitchell and Lee Jones for their invaluable assis
tance with this research; James Carlson, Al Rogers, and Steve Szyszkiewicz for
providing the data; and Nambury Raju and an anonymous reviewer for their
helpful comments on an early version of this paper.
REFERENCES
Brennan, R.L.
1998 Misconceptions at the intersection of measurement theory and practice. Educational
Measurement: Issues and Practice 17(1):59, 30.
Chen, T., and M.L. Davison
1996 A multidimensional scaling, paired comparisons approach to assessing unidimensionality
in the Rasch model. In Objective Measurement: Theory into Practice, vol. 3, G. Engelhard
and M. Wilson, eds. Norwood, N.J.: Ablex.
Chen, WH, and D. Thissen
1997 Local dependence indices for item pairs using item response theory. Journal of Educa
tional and Behavioral Statistics 22:265289.
Joreskog, K.G., and D. Sorbom
1993 LISREL8 User's Reference Guide. Mooresville, Ind.: Scientific Software.
McDonald, R.P.
1967 Nonlinear factor analysis. Psychometrika Monograph Supplement. No. 15.
OCR for page 101
122 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA
Mislevy, R.J., A.E. Beaton, B. Kaplan, and K.M. Sheehan
1992 Estimating population characteristics from sparse matrix samples of item responses.
Journal of Educational Measurement 29: 133  161.
Muraki, E.
1992 A generalized partial credit model: Application of an EM algorithm. Applied Psycho
logical Measurement 16: 159 176.
National Assessment Governing Board (NAGB).
1996 Science Framework for the 1996 National Assessment of Educational Progress. Wash
ington, D.C.: NAGB.
Novick, M.R.
1966 The axioms and principal results of classical test theory. Journal of Mathematical Psy
chology 3: 1 18.
Rogers, H.J.
1996 POLYFIT. Unpublished computer program, Teachers College, Columbia University,
New York, N.Y.
Samejima, F.
1969 Estimation of latent ability using a response pattern of graded scores. Psychometrika
Monograph Supplement 4(Part 2):Whole #17.
Sireci, S.G.
1997 Dimensionality Issues Related to the National Assessment of Educational Progress. Com
missioned paper by the National Academy of Sciences/National Research Council's
Committee on the Evaluation of National and State Assessments of Educational Progress.
Washington, DC: National Research Council.
Sireci, S.G., D. Thissen, and H. Wainer
1991 On the reliability of testletbased tests. Journal of Educational Measurement 28:237247.
Tanaka, J.S., and G.J. Huba
1985 A fit index for covariance structure models under arbitrary GLS estimation. British
Journal of Mathematical and Statistical Psychology 38:197201.
Thissen, D.
1991 MULTILOG: Multiple Categorical Item Analysis and Test Scoring Using Item Response
Theory, Version 6. Computer program. Mooresville, Ind.: Scientific Software.
Yen, W.M.
1993 Scaling performance assessments: Strategies for managing local item dependence. Jour
nal of Educational Measurement 30:187214.
Zhang, J.
1997 A New Approach for Assessing the Dimensionality of NAEP Data. Paper presented at
the annual meeting of the American Educational Research Association, Chicago, March.