Read "Grading the Nation's Report Card: Research from the Evaluation of NAEP" at NAP.edu

« Previous: 4 An External Evaluation of the 1996 Grade 8 NAEP Science Framework

Page 101 Cite

Suggested Citation:"5 Appraising the Dimensionality of the 1996 NAEP Science Assessment Data." National Research Council. 2000. Grading the Nation's Report Card: Research from the Evaluation of NAEP. Washington, DC: The National Academies Press. doi: 10.17226/9751.

Page 102 Cite

Page 103 Cite

Page 104 Cite

Page 105 Cite

Page 106 Cite

Page 107 Cite

Page 108 Cite

Page 109 Cite

Page 110 Cite

Page 111 Cite

Page 112 Cite

Page 113 Cite

Page 114 Cite

Page 115 Cite

Page 116 Cite

Page 117 Cite

Page 118 Cite

Page 119 Cite

Page 120 Cite

Page 121 Cite

Page 122 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Appraising the Dimensionality of the 1996 Grade ~ NAEP Science Assessment Data Stephen G. Sireci, H. Jane Rogers, Hariharan Swaminathan, Kevin Meara, and Fre'4e'ric Robin The science assessment of the 1996 National Assessment of Educational Progress (NAEP) represents significant advances in large-scale assessment. In particular, this assessment featured carefully constructed "hands-on" performance tasks considered to better measure real-world science knowledge and skills. Furthermore, like the other subject tests in the NAEP battery, these assessments used comprehensive and innovative sampling, scoring, and scaling procedures. To document the science knowledge and skills of our nation' s students, great care was taken in operationally defining the science domains to be measured on the assessment. For the 1996 grade 8 science assessment, which is the focus of this paper, three separate score scales were derived for three separate fields of sci- ence earth science, life science, and physical science. The purposes of the research presented here were to evaluate the structure of the item response data gathered in the 1996 NAEP science assessment and com- pare this structure to the one specified in the framework that governed the test development process. The dimensions composing this framework are described in detail by the National Assessment Governing Board (1996) as well as by Sireci et al. (Chapter 4, this volume). In brief, the framework specified four dimen- sions: "fields of science" (a content dimension), "ways of knowing and doing science" (a cognitive skill dimension), "themes of science," and "nature of sci- ence." The first dimension is particularly important for evaluating the structure of the assessment data because each item in the assessment was linked to one of the three fields of science, and separate score scales were derived for each field. Thus, this first dimension was influential in determining how the test booklets 101

102 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA were constructed, how the booklets were spiraled during test administration, and how the scores were derived to report the results. The word dimension as used in the NAEP frameworks refers to theoretical components that provide a structure for describing what NAEP tests and items measure. However, dimension has several different meanings in the psychometric literature (Brennan, 1998~. For example, a dimension could be defined statisti- cally as a latent variable that best accounts for covariation among test items. Sireci (1997) points out that these two different conceptualizations of test dimensionality should be related to one another. In summarizing dimensionality issues related to NAEP, he concludes that "there is an absence of research relat- ing the theoretical dimensions specified in the content frameworks to the empirical dimensions arising from analysis of item response data" (p. i). The present study represents an assessment of the dimensionality of the 1996 grade 8 NAEP science assessment data. The purposes motivating this research are straightforward and specific. In scaling the data for this assessment, the contractor (Educational Testing Service, ETS) used unidimensional item response theory (IRT) models fit separately to each of the three fields of science. Thus, the intended structure of this assessment comprised three unidimensional scales, one each for the earth, life, and physical sciences. The analyses carried out were aimed at evaluating whether the observed item responses conformed to this intended structure. A further purpose of the analyses was to determine if system- atic sources of multidimensionality (that would threaten the validity of the IRT scaling procedure) were present in these data. These analyses were aimed at gathering critical evidence for evaluating the validity of inferences derived from the NAEP scores. METHOD Data A comprehensive set of analyses was performed on the data obtained from the 1996 grade 8 NAEP science assessment. Item response data (test data) were available for 11,273 students. The item pool comprised 189 items partitioned into 15 blocks. Each student responded to three blocks of test items, one of which comprised items associated with one of the four hands-on tasks. These data were provided by the ETS and were the same data used for scoring, scaling, and reporting the results. A description of these blocks in terms of the item types, item content and cognitive specifications, and sample sizes is presented in Table 5-1. The results for the 1996 grade 8 NAEP science assessment were reported on a composite score scale, which was a weighted composite of the three fields of science scales. Thus, there are four score scales of interest in evaluating the assessment: the composite score scale and the earth, physical, and life sciences scales.

S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 103 TABLE 5-1 Composition of Item Blocks on Grade 8 NAEP Science Assessment Number of Items Categorized as Field of Science Ways of Knowing Item Format Block Earth Life Physical CU PR SI MCCR Hands on? N S3 6 1 5 6 Yes 2,961 S4 5 1 3 3 1 5 3 6 Yes 2,739 S5 7 4 3 7 Yes 2,711 S6 2 4 4 2 6 Yes 2,861 S7 12 7 2 3 2 10 No 2,401 S8 10 9 1 5 5 No 2,424 S9 13 10 3 3 10 No 2,401 S10 6 6 4 10 3 3 8 8 No 1,784 S 11 7 2 7 8 6 2 8 8 No 1,797 S 12 3 7 6 11 2 3 8 8 No 1,806 S 13 6 4 5 7 4 4 8 7 No 1,947 S 14 3 5 8 13 3 7 9 No 2,412 S 15 4 5 6 8 3 4 6 9 No 1,836 S20 6 4 6 8 7 1 8 8 No 1,939 S21 3 6 7 10 5 1 7 9 No 1,797 Total: 62 65 62 108 43 36 73 116 Correlational Analyses Raw Score Correlations For each test booklet, correlations were computed among the raw scores for the field of science content areas. These raw scores were computed by summing the item scores for those items in a booklet that corresponded to the same content area. These correlations based on the raw score metric are not representative of the scaled scores for each field of science derived using IRT. However, the correlations do provide a preliminary and straightforward indication of the simi- larities among the three fields of science. High correlations among these subscores (e.g., .9 or higher) would provide evidence that the same proficiencies are being measured by the respective fields of science. On the other hand, moderate correlations would suggest that more unique proficiencies were being measured. Both raw correlations and correlations corrected for unreliability were examined. To obtain disattenuated correlations among the subscores, coefficient alpha reliabilities were used.

104 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA IRT-Derived Theta Correlations While correlations among the raw subscores present useful information regarding the structure of the data, correlations among the IRT-derived ability, or theta, scores may be more appropriate for examining the dimensionality of the scaled scores since NAEP analyses are based on these derived scores. To deter- mine these ability scores, the items comprising each block were calibrated sepa- rately using IRT (see description below). Subsequently, these item parameters were used to compute a "block proficiency estimate" (block theta estimate) for each student. Because each student responded to three item blocks, three separate theta estimates were computed. The correlations among these theta estimates were compared with the content composition of each block. The logic motivating this analysis was that if high (disattenuated) correlations were observed among blocks that measured more than one field of science, evidence was obtained that the three fields were measuring one general dimension of science proficiency. On the other hand, if the correlations based on blocks measuring different fields of science were substantially lower than those measuring the same field of science, evidence of relatively unique dimensions measured by each field would be obtained. Thus, we were interested in both the magnitude and the pattern of these correlations. The theta correlations were disattenuated (corrected for unreliability) using the marginal reliabilities estimated in the calibration in each block. As described below, the sample sizes per block were sufficient for estimating indi- vidual student thetas. However, our block-level scaling treated each block as if it measured a single latent trait, thus ignoring the explicit scaling structure used in the operational NAEP scaling. Another potential limitation of this analysis is that there may be too few items per field within a block to provide unique variance associated with that field. Nevertheless, inspecting these correlations with the expectations described above provided a different lens through which to view the idea of composite and separate science proficiency scales. Principal Components Analysis As a preliminary check on dimensionality, data from four test booklets were analyzed using principal components analysis (PCA). PCA could not be used to simultaneously evaluate the dimensionality of the whole set of 189 items because of the balanced incomplete block (BIB) spiral design. The four booklets chosen (numbers 209, 210, 231, and 232) involved 12 of the 15 blocks (152 of the 189 items) and included all four hands-on tasks. Separate PCAs were conducted on each booklet. The eigenvalues associated with the extracted components and the percentages of variance in the item data accounted for by these components were used to evaluate the dimensionality of each booklet.

S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 105 IRT Residual Analyses The fit of IRT models to the data was evaluated directly by calibrating each block using a unidimensional IRT model. The decision to calibrate each block separately was motivated by sample size considerations (i.e., the booklet-level sample sizes were too small for IRT scaling) and the presence of large blocks of incomplete data in the student-by-item matrix (11,273 students by 189 items). Because each student responded to only about 36 items on average, the entire pool could not be calibrated concurrently due to the inability to properly estimate the interitem covariance matrix. In the operational scaling of NAEP, this prob- lem is overcome by using the plausible values methodology (i.e., by conditioning calibration on a comprehensive vector of covariates derived from student back- ground variables; see Mislevy et al., 1992, for more complete details of the NAEP scaling methodology). This conditioning was not possible given the time and software limitations of this study. Thus, these block-specific calibrations evaluated model-data fit in a manner independent of the plausible values method- ology. If the data comprising a block are essentially unidimensional, these IRT calibrations should exhibit good fit to the data. As can be seen from Table 5-1, the sample sizes were appropriate for calibrating each block using an IRT model. The smallest sample size was 1,784, and the largest number of parameters esti- mated in any of the calibrations was 49. All IRT calibrations were conducted using the computer program MULTILOG, version 6.1 (Thissen, 1991~. The multiple-choice items were calibrated using a three-parameter IRT model (3p),1 and short constructed-response items that were scored dichotomously were calibrated using a two-parameter IRT model (2P). These models were identical to those used by the ETS in calibrating these same items. For the constructed-response items that were scored polytomously (i.e., a student could earn a score greater than one), Samejima's (1969) graded response (GR) model was used. The GR model is similar but not equivalent to the Generalized Partial Credit (GPC) model (Muraki, 1992) used by ETS to calibrate these items. In both the GPC and the GR models, a common slope (discrimina- tion) parameter is assumed for the response functions of each item score category while separate threshold (location) parameters are assumed for each score cat- egory. However, because of the dependency that exists among the threshold parameters (i.e., the choice of the first k-1 categories determines whether an examined chooses the last category), the number of location parameters for an item is one less than the number of response categories. For example, a constructed-response item scored from zero to three (i.e., four response catego 1 For the 3P models, priors were used on the c parameters, where the prior was equivalent to the reciprocal of the number of response options for each item. The effect of these priors was evaluated by also calibrating the items without the priors. The results were very similar, which was not surprising given the relatively large sample sizes.

106 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA ries) is modeled using four parameters: one common slope parameter for each score category and three location parameters. Although calibrating the polytomously scored items with the OR model using MULTILOG differs from the GPC model fitted using PARS CALE (which was used by ETS), the effects of this difference are considered to be minimal given the purpose of the analyses (i.e., to determine departure of the response data from unidimensionality). MULTILOG was used in this study because the modified version of PARS CALE used by ETS to calibrate the NAEP items was not publicly available. To evaluate IRT model-data fit, a residual analysis was performed using the program POLYFIT (Rogers, 1996~. The POLYFIT program uses the item and person parameter estimates obtained from MULTILOG to compute the expected score for examiners at a given proficiency (theta) level. These expected scores are compared with the corresponding average observed scores and residuals are computed. Specifically, the group of examiners is divided into 12 equal theta intervals constructed in the range (mean theta + 3 standard deviations), with interval width equal to .5 standard deviations. The midpoint of each interval is used to calculate the expected score in that interval, where k is the category score and P(k) is the probability that an individual with given theta will score in cat- egory k. The difference between the average observed score and the expected score in each interval is computed. This residual is then standardized by dividing by the standard error, which is obtained from the standard deviation of the dis- crete random variable ~k2P(k) - t~kP(k)~ , where k and P(k) are defined above. The standardized residuals computed at the score level are analogous to those routinely computed for dichotomous IRT models by comparing observed and expected proportions correct. Standardized residuals are reported only for cells with a frequency of 10 or more. The stan- dardized residuals may be examined for each item to assess the fit of individual items. In addition, a frequency distribution of the standardized residuals over items is provided as a summary of the overall fit of the model. When the model fits the data, the distribution of standardized residuals should be symmetric with a mean close to zero. While there is no theory to show that the residuals are normally distributed when the model fits the data, it is reasonable to expect a roughly normal distribution, with few standardized residuals with an absolute value greater than three. A chi-square statistic is also calculated using observed and expected fre- quencies of examiners in each score category. Expected frequencies are obtained by calculating the probability that an examiner at the midpoint of each theta interval would score in each response category. Results of the chi-square analysis

S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 107 should be interpreted with caution. The chi-square statistic is at best only approximately distributed as a chi-square; it has the usual failings of IRT chi- square fit statistics in that it is sensitive to sample size, the arbitrary nature of the theta intervals, and heterogeneity in the theta levels of examiners grouped in the same interval. Hence, it should be used only descriptively and the significance level ignored. It was not hypothesized that all of the 15 blocks could be fit adequately using a single unidimensional scale. In fact, such a hypothesis is contrary to the scaling models used to score the assessment. As seen in Table 5-1, all blocks do not comprise items from a single field of science. Those blocks that do comprise items from a single field of science (blocks 3, 5, 7, 8, and 9) should conform to a unidimensional scale (i.e., exhibit relatively small, normally distributed standard- ized residuals). Conversely, those blocks containing items from more than one field may, unsurprisingly, depart from unidimensionality (i.e., exhibit relatively larger, nonnormally distributed standardized residuals). Thus, the hypotheses motivating our block calibration/residual analysis evaluations involved compar- ing the results of the residual analyses with the a priori expectations of dimen- sionality given the content-area designations of the items composing a given block. More specifically, if blocks containing items from only one field of science exhibited small residuals and the blocks containing items from two or three fields exhibited larger residuals, evidence of three separate scales corre- sponding to the three fields of science specified in the framework would be obtained. Factor Analyses Factor analyses (FAs) were also conducted to evaluate the dimensionality of each block. Evaluation of the dimensionality of each block using FA provides an independent assessment of dimensionality from that obtained by assessing the fit of an unidimensional IRT model to the data. It should be pointed out that FA is appropriate only when the relationship between item responses and the under- lying trait is linear. When the relationships between the item responses and the underlying traits are nonlinear, procedures based on nonlinear factor models are necessary. Item response theory is an example of a nonlinear factor analysis procedure and is the procedure of choice for evaluating the dimensionality of nonlinear data. The problem is that currently only unidimensional IRT models (for dichotomous and polytomous responses) have been developed with commer- cially available software. Multidimensional IRT models have been proposed, but these do not have the necessary software for data analysis. One exception is the nonlinear factor analysis procedure developed by McDonald (1967) in which nonlinear trace lines are approximated by polynomials. The computer program NOHARM implements this procedure; however, the program is not designed to handle polytomous data. Given these considerations, the linear factor model was

108 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA used as an approximation to nonlinear models to evaluate the dimensionality of the data especially when the hypothesis that several dimensions underlie the . · ~ responses IS examlnea. A one-factor model was fit to the data for each block of items. If the blocks contained items from two fields, a two-factor model was fit to the data with items from each field constrained to load on two separate factors. If the blocks con- tained items from all three fields, a constrained three-factor model was fitted to the data. Because of constraints, the two- and three-factor models are analyzed using the confirmatory, rather than exploratory, factor analysis procedure; there is no distinction between confirmatory and exploratory procedures with the one- factor model. The analyses were carried out using the LISREL 8 computer program (Joreskog and Sorbom, 1993~. The correlation matrix analyzed was based on product moment as well as tetrachoric and polyserial correlations. When the correlation matrix was based on the tetrachoric or the polyserial correlations, the generalized least squares procedure rather than the maximum likelihood proce- dure was used when the correlation matrix was not positive definite. The fit of the model was evaluated by examining the goodness of fit index (GFI), adjusted goodness of fit index (AGFI), and residuals rather than the likeli- hood ratio statistic. When the data are nonlinear, particularly when they are nonnormal, and when tetrachoric/polyserial correlations are used, the likelihood ratio statistic is unreliable. The GFI and the AGFI provide adequate assessments of dimensionality (Tanaka and Huba, 1985~. For this study, values of GFI and AGFI greater than .90 were taken as indications of adequate fit of the model. Multidimensional Scaling Analyses Multidimensional scaling (MDS) was used to evaluate the dimensionality of all the dichotomously scored items. These analyses followed the unidimension- ality testing procedure developed by Chen and Davison (1996~. This procedure involves computing pseudo-paired comparison (PC) statistics that represent the similarity between two dichotomously scored items, as determined from examin- ees' performance on the items. Given this restriction, the MDS analyses were conducted using only the multiple-choice items and those short constructed- response items that were also scored dichotomously. Chen and Davison recom- mend fitting one- and two-dimensional MDS models to the matrix of item PC statistics and comparing the results. If the one-dimensional model fits the data well, the coordinates correlate highly with the item difficulties (p values), and an e-shape or u-shape pattern is observed in two dimensions (suggesting overfilling the data), the data can be considered unidimensional. This comparison is qualita- tive, rather than relying on a statistical index. Two descriptive fit indices were used to evaluate fit of the MDS models to the data: STRESS and R2. The STRESS index represents the square root of the normalized residual variance of

S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 109 the monotonic regression of the MDS distances on the transformed PC statistics. Thus, lower values of STRESS indicate better fit. The R2 index reflects propor- tion of variance of the transformed data accounted for by the MDS distances. Thus, higher values of R2 indicate better fit. In general, STRESS values near or below .10 and R2 values of .90 or greater are indicative of reasonable data-model fit. There were 91 dichotomously scored items (73 multiple-choice and 18 short constructed-response items) analyzed using MDS. RESULTS Principal Components Analysis As mentioned above, data from four test booklets were analyzed using prin- cipal components analysis (PCA). These four booklets (booklets 209, 210, 231, and 232) involved 12 of the 15 blocks (152 of the 189 items) and included all four hands-on tasks. Separate PCAs were conducted on each booklet. The number of items composing each booklet ranged from 33 to 40. The sample sizes for each booklet were approximately the same, ranging from 274 to 284. Booklet 209 comprised 38 items from blocks S3, Sll, and S12: 10 earth, 9 life, and 19 physical science items (16 multiple-choice and 22 constructed- response items). The first principal component (eigenvalue = 12.4) accounted for 33 percent of the variance. However, the second factor was also relatively large (eigenvalue = 5.8) and accounted for 15 percent of the variance. Inspection of the unrelated component (factor) loading matrix revealed 10 items with loadings below .3 on the first factor. These items came from different blocks and content areas, but all were constructed-response items. (Five of these items had loadings larger than .30 on the second factor.) The scree plot for booklet 209 is presented in Figure 5-1. Booklet 210 comprised 40 items from blocks S4, S13, and S14: 14 earth, 10 life, and 16 physical science items (18 multiple-choice and 22 constructed- response items). The first principal component (eigenvalue = 11.0) accounted for 28 percent of the variance, and the second principal component (eigenvalue = 3.3) accounted for 9 percent of the variance. Inspection of the unrelated factor loadings revealed three items with loadings less than .3 on the first factor. Two items came from block S4; the other was from block S13. All three items were earth science items. One item was a constructed-response item from block S13; the other two were from block S4, one of which was a multiple-choice item. The scree plot for this booklet is presented in Figure 5-2. Booklet 231 comprised 40 items from blocks S5, Sin, and S21: 17 earth, 12 life, and 11 physical science items (15 multiple-choice and 25 constructed- response items). The first principal component (eigenvalue = 17.0) accounted for 45 percent of the variance and the second principal component (eigenvalue = 4.0) for 11 percent. Inspection of the unrelated factor loadings revealed five items

110 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA 20 18 16 14 124 - ~ 1n .~ o 37 . 1 5 9 13 17 Component Number FIGURE 5-1 Scree plot from PCA for booklet 209. 20 18 16 14 12 10 . _ 8 6 4 2 O _ 21 25 29 33 1 5 _+ _ . ~ 9 13 17 21 25 29 33 37 Component Number FIGURE 5-2 Scree plot from PCA for booklet 210.

S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 1l l with loadings less than .3 on the first factor: four constructed-response earth science items from block S5 and a constructed-response life science item from block S 10. Only one of the five constructed-response items had a relatively large loading on the second factor. The scree plot for this booklet is presented in Figure 5-3. Booklet 232 comprised 33 items from blocks S6, S7, and S15: 16 earth, 7 life, and 10 physical science items (8 multiple-choice and 25 constructed-re- sponse items). The first principal component (eigenvalue = 9.3) accounted for 28 percent of the variance and the second principal component (eigenvalue = 4.9) for 15 percent. Inspection of the unrelated factor loadings revealed five items with loadings of less than .3 on the first factor: three constructed-response life science items (one from block S6 and two from block S15) and two physical science items from block S 15 (one of which was a multiple-choice item). The scree plot for this booklet is presented in Figure 5-4. Using the ratio of the percentage of variance accounted for by the first two components, booklets 210 and 231 appear to be unidimensional. The first com- ponent accounts for three times as much variance as the second component in each of these two booklets. A case for unidimensionality may also be made for all four booklets because of the relatively large percentage of variance accounted for by the first component (minimum 28 percent). However, a substantial pro- portion of variance is accounted for by the second factor underlying all four booklets (especially for booklets 209 and 232), and each booklet exhibited some items with higher loadings on a factor other than the first. Thus, the PCAs indicate a small degree of multidimensionality in these data. This multidimen- sionality was not linked to content area or cognitive level, but it was noted that some of the constructed-response items had small loadings on the first factor. It should also be noted that PCA has been widely criticized for producing spurious factors when applied to test score data. Raw Score Correlational Analysis The relationship among the three fields of science was also evaluated at the booklet level by deriving three "content-area raw scores" for each student. The correlations among these earth, life, and physical science raw scores were then calculated. Raw scores derived from booklets containing only a few items corre- sponding to a field of science (specifically, those raw scores that produced a scale less than 10 points in length) were eliminated from this correlational analysis. In addition, raw scores with internal consistency (coefficient alpha) reliabilities of less than .50 were eliminated. This process resulted in 21 correlations among earth and physical science raw scores, 17 correlations among life and physical science raw scores, and 15 correlations among earth and life science raw scores. The 21 earth-physical correlations ranged from .61 to .79, and the median corre

112 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA 20 18 16 14 12 it ~10 .~ 8 4 t it_ at l 1 5 9 13 17 21 25 29 33 37 _ 17 21 25 Component Numbe FIGURE 5-3 Scree plot from PCA for booklet 231. 20 18 16 14 10 . _ 12 8 61 4 r - 1 \ 2 n . 1 -I_ 1 5 9 13 0~ _ 17 21 25 29 33 Component Number FIGURE 5-4 Scree plot from PCA for booklet 232.

S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 113 ration was .69. After disattenuation (correction for measurement error2), these correlations ranged from .83 to 1.0, and the median correlation was .99. The 17 physical-life correlations ranged from .54 to .73; the median correlation was .64. After disattenuation, these correlations ranged from .83 to 1.0, with a median correlation of .97. The 15 earth-life correlations ranged from .53 to .71, with a median correlation of .62. After disattenuation these correlations ranged from .83 to 1.0, with a median correlation of .91. The magnitudes of the median disattenuated correlations (.99, .97, and .91) suggest that the three fields of science were essentially measuring the same construct. The results of these correlations are summarized in Table 5-2. Results Stemming from IRT Analyses As described earlier, MULTILOG was used to calibrate each of the 15 sci- ence item blocks. Unfortunately, an unidentifiable problem, internal to MULTILOG, prevented calibration of block S7 (a block comprising 12 earth science items). Successful item calibrations were obtained for the other 14 blocks; however, we were unable to estimate thetas based on students' responses to block S14 (a block comprising eight physical, five life, and three earth science items). The marginal reliabilities for the 14 blocks that were calibrated ranged from .39 (block S6, which was a hands-on block containing four physical and two life science constructed-response items) to .80 (block S4, which was a mixed hands- on block comprising six constructed-response items and three multiple-choice items). The median marginal reliability across the 14 blocks was .75. Correlations Among Separate (Block) Theta Estimates As the description "balanced incomplete block spiraling design" indicates, not all of the 15 item blocks were paired with each other. Thus, our analyses of the block-specific thetas estimated for each student included all available correla- tions among the blocks that were successfully calibrated and scored using MULTILOG (except block S6, which exhibited inadequate marginal reliability) and that were paired together in at least one test booklet. Each student responded to three blocks of items; thus, three separate "block" thetas were computed for each student. A total of 56 block theta correlations were computed. There were no data available for computing correlations among an earth science block and a 2The disattenuated correlations were computed by dividing the raw score correlation by the square root of the product of the reliability estimates for each content-area raw score. Because the alpha coefficient is known to be an underestimate of reliability (Novick, 1966), the disattenuated correla- tions are overestimates and may at times be greater than one. Nine of the 53 disattenuated correla- tions were greater than one: six earth-physical correlations, two earth-life correlations, and one physical-life correlation. These correlations were truncated to 1.0.

114 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA TABLE 5-2 Summary of Field of Science Raw Score Correlations Number of Unadjusted Correlations Disattenuated Correlations Scores Correlated Correlations Range Median Range Median Earth and physical 21 .61 to .79 .69 .83 to 1.0 .99 Life and physical 17 .54 to .73 .64 .83 to 1.0 .97 Earth and life 15 .53 to .71 .62 .83 to 1.0 .91 physical science block; however, there were two correlations available for both earth-life science comparisons and life-physical science comparisons. The remaining 52 correlations involved seven correlations among an earth science block (block S5) and "mixed"-item blocks (i.e., blocks containing items from all three fields of science), 23 correlations among life science (blocks S8 or S9) and mixed-item blocks; seven correlations among a physical science block (block S3) and mixed- item blocks, and 15 correlations among mixed-item blocks. All correlations were disattenuated using the MULTILOG marginal reliability estimates. A summary of the theta-based correlational analyses is presented in Table 5-3. The unadjusted correlations among thetas derived from mixed-item blocks (15 correlations) ranged from .50 to .79. After correcting for measurement error, these correlations ranged from .76 (S21, S10) to 1.00,3 with a median correlation of .87. The range and relatively large disattenuated correlations suggest that these mixed blocks, containing items from all three fields of science, were prob- ably measuring the same general science proficiency construct. The magnitude of these correlations among the mixed-item blocks was similar to or higher than the observed marginal reliabilities for these blocks. The unadjusted correlations for the earth-mixed comparisons ranged from .40 to .59 and after disattenuation from .58 (S5, S21) to .77 (S5, Sly; the median disattenuated correlation was .68. The two disattenuated earth-life correlations were .63 and .72. These correlations are lower than those observed for the mixed-block correlations, leaving the door open for the conclusion that the earth science items, at least those included in block S5, do measure a slightly different construct than general science proficiency. The unadjusted correlations for the physical-mixed comparisons ranged from .44 to .60 and after disattenuation from .63 (S3, S21) to .82 (S3, Sll), with a median disattenuated correlation of .76. The two life-physical disattenuated correlations were .72 and .85. These correlations are also relatively low, suggest- ing that the physical science items may also be measuring a somewhat unique domain of science proficiency. 3Actually, three of the 56 disattenuated correlations were slightly greater than one; all were from correlations of thetas derived from two mixed blocks.

S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 115 TABLE 5-3 Summary of Block-Derived Theta Correlations Types of Number of Unadjusted Correlations Disattenuated Correlations Blocks Correlated Correlations Range Median Range Median Mixed and mixed 15 .50 to 79 .66 .76 to 1.0 .87 Life and mixed 14 .50 to .70 .64 .73 to .94 .88 Physical and mixed 7 .44 to .60 .56 .63 to .82 .76 Earth and mixed 7 .40 to .59 .51 .58 to .77 .68 Notes: The earth and mixed correlations are between block S5 and mixed blocks; the physical and mixed are between block S3 and mixed blocks; and the life and mixed are between blocks S8 or S9 and mixed blocks. Mixed blocks contain items from all three fields of science. The unadjusted correlations for the life-mixed comparisons ranged from .50 to .70. The disattenuated correlations ranged from .73 (S5, S8) to .94 (S9, S20), and the median correlation was .88. The magnitudes of the disattenuated correla- tions suggest that the life science items (blocks S8 and S9) may be more closely related to general science proficiency than the earth and physical science items. POLYFIT Analyses The fit of the IRT models for each block was evaluated using POLYFIT (Rogers, 1996~. Distributions of the standardized residuals generated from the POLYFIT program are presented in Table 5-4. Estimates could not be obtained for four of the 15 blocks (S4, S7, S9, and Sib. For blocks S10 through S21 the unidimensional IRT models appear to fit the data adequately. The residual analy TABLE 5-4 Distribution of Standardized Residuals for Each Block Theta Interval Block -3 to-2 -2to-1 -ltoO Otol lto2 2to3 >3 Mean S3 4.55 0 4.55 42.42 42.42 1.52 1.52 3.03 -0.05 S5 3.90 1.30 2.60 40.26 40.26 1.30 2.60 7.79 0.13 S6 1.67 1.67 5.00 33.33 48.33 3.33 1.67 5.00 0.11 S8 11.00 3.00 7.00 28.00 30.00 13.00 6.00 2.00 -0.15 S10 2.50 3.13 5.63 36.25 36.25 9.38 5.63 1.25 0.04 S 11 3.13 3.75 6.88 35.00 37.50 8.75 3.75 1.25 -0.04 S 12 2.78 3.47 5.56 32.64 40.97 9.72 2.78 2.08 -0.01 S13 2.96 3.70 8.89 23.70 39.26 8.89 8.89 3.70 0.11 S15 2.67 2.67 5.33 39.33 36.00 12.00 2.00 0 -0.04 S20 3.13 3.13 10.00 31.88 31.88 16.25 3.13 0.63 -0.04 S21 1.88 0.63 7.50 41.88 38.13 6.88 2.50 0.63 -0.02 Notes: Table entries are percentages of residuals falling within each interval. Obtained for blocks 4, 7, 9, and 14. Estimates could not be

116 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA ses show that most residuals are close to zero, with only a small proportion (no more than about 5 percent) falling outside the range (-3,3~. For blocks S3, S5, S6, and S8, the model did not fit as well. Blocks S3, S5, and S6 consist of performance tasks. These blocks have six, eight, and six items, respectively. Blocks S3 and S5 contain items from only one scale, while block S6 has four physical items and two life items. Block S8 consists of 10 items, all measuring life science. For block S3, examination of the residuals reveals that most of the large ones were obtained from item 1. This was the only dichotomously scored item in the block, fitted using the two-parameter model. For this item the residuals tended to be negative at the low end of the proficiency continuum and positive at the high end, suggesting that the a-parameter may have been underestimated. In block S5, items 5 and 7 appeared to fit poorly. Both of these items were dichotomously scored and fitted using the two-parameter model. Item 5 showed no clear pattern in the residuals, while item 7 produced large positive residuals at the upper theta levels. In block S6, item 4 yielded poor fit. This item, again, was a dichotom- ously scored item, fitted using the two-parameter model. It was a very difficult item. The residuals showed the same pattern as was observed for the other poorly fitting items. For block S8 all of the dichotomously scored items (5 MC, 1 2P) showed some degree of misfit, with the largest (negative) residuals occurring at the low end of the proficiency continuum. A summary of the POLYFIT analyses is presented in Table 5-5. The summary of the POLYFIT analyses presented in Table 5-5 illustrates that the results were contrary to our expectations. We expected blocks compns- ing items from only one field of science to be fit well by the unidimensional IRT models, and blocks comprising items from more than one field of science not to be fit well by these models. However, the opposite pattern emerged. Blocks TABLE 5-5 Summary of POLYFIT Results Block Content Item Types Expectation Result Problem Items S3 P 6 CR Good fit Poor fit 1 2P item S5 E 8 CR Good fit Poor fit 2 2P items S6 L, P 6 CR Poor fit Poor fit 1 2P item S8 L 5 MC, 5 CR Good fit Poor fit 5 MC, 1 2P item S10 E, L, P 8 MC, 8 CR Poor fit Adequate fit Sll E, L, P 8 MC, 8 CR Poor fit Adequate fit S12 E, L, P 8 MC, 8 CR Poor fit Adequate fit S13 E, L, P 8 MC, 8 CR Poor fit Adequate fit S15 E, L, P 6 MC, 9 CR Poor fit Adequate fit S20 E, L, P 8 MC, 8 CR Poor fit Adequate fit S21 E, L, P 7 MC, 9 CR Poor fit Adequate fit Notes: E = earth science, L = life science, P = physical science; CR = constructed-response item, MC = multiple-choice item; 2P = dichotomously scored constructed-response item.

S.G. SIRECI, H.~. ROGERS, H. SWAMINATHAN, K MEAN, AND F. ROBIN 117 comprising items from all three fields of science were fit adequately using IRT, and those blocks comprising items from a single field of science exhibited rela- tively poor fit. Factor Analyses A summary of the results of the factor analyses is presented in Table 5-6. A one-dimensional factor model was fit to each block. Fortunately, the results obtained with the Pearson product-moment correlations and the tetrachoric/ polyserial correlations did not differ substantially; hence, only the results based on the product-moment correlations are provided. The goodness-of-fit indices (GFI and AGFI) were used to evaluate the model-data fit. The model-data fit was considered to be reasonable when GFI and AGFI were equal to or greater than .90. As shown in Table 5-6, 12 of the 15 blocks were adequately fit using the one-factor model, indicating that the data can be considered unidimensional. Blocks S3 and S14 came close to meeting the fit criterion the GFI for both blocks exceeded .90, but the AGFI was .89 for both blocks. The only block that did not meet the fit criterion was block S5, which was a block of hands-on earth science items the GFI was .84, while the AGFI was .72. The other hands-on tasks (blocks S3, S4, and S6) fitted the one-factor model adequately. Since S5 was made up of items from one content area, a multifactor model was not fit to the item responses. Given the high fit index values obtained with the one-factor model for all of the blocks, the acceptable fit values obtained with S3 and S14, TABLE 5-6 Summary of Confirmatory Factor Analysis Results Block Content Item Types GFI/AGFI Areas S3 P CR .95/.89 S4 E,L,P CR,MC .95/.92 S5 E CR .84/.72 S6 L,P CR .99/.99 S7 E CR,MC .99/.98 S8 L CR,MC .99/.99 S9 L CR,MC .99/.99 S10 E,L,P CR,MC .99/.98 Sll E,L,P CR,MC .98/.98 S12 E,L,P CR,MC .98/.98 S13 E,L,P CR,MC .99/.96 S14 E,L,P CR,MC .91/.89 S15 E,L,P CR,MC .99/.98 S20 E,L,P CR,MC .99/.98 S21 E,L,P CR,MC .99/.98 Notes: E = earth science, L = life science, P = physical science, CR = constructed-response, MC = multiple-choice.

118 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA and the fact that S5 was comprised of items from a single content area, only the results obtained from fitting a one-factor model are presented in Table 5-6. Never- theless, two- and three-factor confirmatory factor analyses were carried out for block S14. Unfortunately, the two- and three-factor solutions did not converge for S14, and hence the improvement in fit that may have resulted from fitting multifactor models could not be examined. MDS Analysis of All Dichotomous Items As described earlier, the PC statistic suggested by Chen and Davison (1996) provides a formal analysis of unidimensionality of dichotomous test data using MDS. Because of its limitation to analysis of only dichotomously scored items, we applied the procedure to only the multiple-choice and dichotomously scored short constructed-response items. Almost half (91 of 189) of the items were scored dichotomously: 73 multiple-choice items and 18 short constructed- response items. Although the results of this analysis cannot be generalized to the dimensionality of the complete dataset, which includes the polytomously scored items, it does evaluate whether the 91 dichotomous items can be considered unidimensional. The one-dimensional MDS solution did not display adequate fit to the data (STRESS = .20, R2= .88~. The item p values correlated .88 with the one- dimensional coordinates; however, they correlated .93 with the coordinates of the first dimension from the two-dimensional MDS solution. The two-dimensional solution fit the data well (STRESS = .10, R2= .97~. Inspection of the item coordinates on the second dimension indicated that the four easiest items (with p values equal to or greater than .87) and the eight most difficult items (with p values equal to or less than .18) had large negative coordinates on this dimension. The item standard deviations correlated .98 with the item coordinates on dimen- sion 2. These coordinates were unrelated to item type (multiple-choice or short constructed-response), field of science, cognitive area, or other item framework characteristics. Therefore, although the one-dimensional MDS solution did not fit these data, the second dimension appears to be a statistical artifact and not a substantive unique dimension. The Chen-Davison procedure was also used to appraise the dimensionality of the 73 multiple-choice items. The one-dimensional model displayed adequate fit to the data (STRESS = .13, R2= .95~. However, improved fit was obtained using two dimensions (STRESS = .10, R2= .96), and 10 items exhibited large negative coordinates on the second dimension. As expected, the coordinates from the one-dimensional solution and those from the first dimension of the two- dimensional solution were highly correlated with the item p values (both r's were around .99~. Similar to analysis of the 91 dichotomous items reported above, dimension 2 corresponded to the extremely easy or extremely difficult items.

S.G. SIRECI, a. ROGERS, H. SWAMINATHAN, K MEA^, AND F. ROBIN 119 Thus, the MDS analysis of the PC statistics for the multiple-choice items suggests that these items are essentially unidimensional. DISCUSSION This study involved several different data analytic strategies for evaluating the dimensionality of the grade 8 NAEP science item response data. Some consistencies were observed across these analyses. For the most part, unidimen- sional models displayed adequate fit to the data. When multidimensionality was observed, it was generally linked to a few items in a block. The analyses most supporting unidimensionality of the data were the FA, the MDS analyses of the dichotomous items using the PC statistic, and the disattenuated field of science raw score correlations. The PCA and the POLYFIT analyses identified some booklets or blocks that were not fit well using a unidimensional model. The observed multidimensionality was not linked to differences among the fields of sciences or other content characteristics of the items. However, the POLYFIT results indicated poorest fit for the dichotomously scored constructed-response items from the three hands-on task blocks, as well as for all of the dichotomously scored items from block S8. The results of the theta-based correlations are difficult to interpret. The correlations observed among the "mixed" blocks (blocks comprising items from all three fields of science) were larger than those observed among blocks com- prising items from a single field of science. This finding could be taken as evidence of multidimensionality in the data resulting from the field of science designations of the items. However, there were only four blocks of items com- prising items from a single field of science, and the residuals from IRT models fit to these blocks were larger than residuals from IRT models fit to mixed blocks (see Table 5-5~. Therefore, it is difficult to conclude that these lower correlations are due to field of science content distinctions. It is noteworthy to reiterate that the block-level IRT calibrations we conducted differ from the field-of-science- specific IRT-derived scale scores used in the operational scoring of NAEP. Thus, the nature of the different "proficiencies" (i.e., thetas) resulting from our block- level calibrations is unknown. In general, however, the relatively high correla- tions observed among the mixed blocks suggests that the fields of science are highly related. Although not explicitly explored in this study, a potential cause for the small degree of multidimensionality observed is "local item dependence" (Sireci et al., 1991; Chen and Thissen, 1997; Yen, 1993~. If students' responses to one item are determined in part from their responses to another item (e.g., as in a multistep problem), this interitem dependence could show up as multidimensionality. Because local item dependence violates the conditional independence assump- tion of IRT, it could affect the plausible theta values computed for students and the NAEP scale values computed for groups. Thus, evaluating the fit of items

120 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA likely to be locally dependent is an important area for future research. Unfortu nately, we did not have access to the text for the actual items and so were unable to determine if some of the larger IRT model residuals or aberrant factor loadings were due to local item dependence. Does the scale structure specifying three separate fields of science appear reasonable? Can the entire NAEP grade 8 science assessment be considered unidimensional? Even after the comprehensive series of analyses performed here, unequivocal answers to these questions cannot be provided. It appears that many of the blocks can be considered unidimensional even though they contain a mix of items from the three separate fields of science and a mix of multiple- choice and constructed-response items. For if the three fields of science repre- sented very different proficiencies, we would expect relatively poor fit for the mixed-item blocks in the POLYFIT and unidimensional FA analyses. However, for the most part, the blocks were fit well using a unidimensional IRT or an FA model. The large disattenuated correlations observed among the field of science raw scores also argues against three separate scales. It is possible that there were too few items in each content area at the block or booklet level to uncover their uniqueness, but it is clear that these three fields of science are highly related. Therefore, reporting the assessment results on a composite score scale certainly seems appropriate. A more equivocal issue is the necessity of three separate score scales. The results of this study suggest that it may be possible to represent the three fields of science using a unidimensional model. If these three fields do not represent distinct dimensions and can be calibrated onto a common scale, it is possible that the number of items required to represent all three fields of science could be reduced (since separate scales would not need to be calibrated). This possibility has implications for reducing the size of the item pool and conse- quently increasing the proportion of items taken by each student. This possibility should be explored further because with fewer items needed to represent general science proficiency a simpler, more "complete" spiraling design is possible, thus reducing the necessity for the complex plausible values scaling methodology. For example, Mislevy et al. (1992) compared plausible values estimation meth- odology with (unconditional) maximum likelihood estimation. Their results sug- gest that, with a sufficient number of items (e.g., 20 or 30), the two procedures provide comparable results (Sireci, 1997~. On average, each student who took the 1996 NAEP grade 8 science assessment responded to about 36 items. Thus, an area of future research is evaluation of the utility of the separate field of science subscores with respect to information gained beyond the composite score. Given the strong relationships among the fields exhibited in this study, if the grade 8 NAEP science results continue to be aggregated and reported only at the group level, it is unlikely that subscores will provide unique diagnostic information. The results from this study are consistent with those of Zhang (1997), who

S. G. SIRECI, H.J. ROGERS, H. SWAMINATHAN, K MEARA, AND F. ROBIN 121 analyzed two of the grade 8 science blocks (S14 and S21) using "theoretical DETECT" and concluded that these mixed blocks were essentially unidimen- sional. In the current study, block 21 displayed adequate fit to a unidimensional IRT model and displayed adequate fit to the one-factor FA model. Block S14 could not be evaluated using POLYFIT but displayed close fit to the one-factor FA model. These two blocks contained a mix of multiple-choice items and constructed-response items and items from all three fields of science, making them good candidates for discovering multidimensionality. The fact that both the Zhang study and the present study were consistent in supporting the unidimen- sionality of these blocks suggests that a unidimensional scale could be used to represent all three fields and that the different item types are measuring the same proficiency. However, the present study looked at all 15 blocks, and two areas of concern were noted: (1) relatively poor fit to an IRT model for three of the four hands-on tasks analyzed using POLYFIT and (2) relatively poorer fit for those constructed-response items that were scored dichotomously. Whether these observations reflect real-item type differences or are specific to a small number of items from the much larger pool should be determined from future research. It is important to bear in mind that this study only analyzed data from the 1996 grade 8 NAEP science assessment. Thus, the results may not generalize to the science assessments administered at other grade levels, to other subject tests in the NAEP battery, or to other NAEP tests administered in different years. ACKNOWLEDGMENTS The authors thank Karen Mitchell and Lee Jones for their invaluable assis- tance with this research; James Carlson, Al Rogers, and Steve Szyszkiewicz for providing the data; and Nambury Raju and an anonymous reviewer for their helpful comments on an early version of this paper. REFERENCES Brennan, R.L. 1998 Misconceptions at the intersection of measurement theory and practice. Educational Measurement: Issues and Practice 17(1):5-9, 30. Chen, T., and M.L. Davison 1996 A multidimensional scaling, paired comparisons approach to assessing unidimensionality in the Rasch model. In Objective Measurement: Theory into Practice, vol. 3, G. Engelhard and M. Wilson, eds. Norwood, N.J.: Ablex. Chen, W-H, and D. Thissen 1997 Local dependence indices for item pairs using item response theory. Journal of Educa- tional and Behavioral Statistics 22:265-289. Joreskog, K.G., and D. Sorbom 1993 LISREL-8 User's Reference Guide. Mooresville, Ind.: Scientific Software. McDonald, R.P. 1967 Non-linear factor analysis. Psychometrika Monograph Supplement. No. 15.

122 APPRAISING THE DIMENSIONALITY OF THE NAEP SCIENCE ASSESSMENT DATA Mislevy, R.J., A.E. Beaton, B. Kaplan, and K.M. Sheehan 1992 Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement 29: 133 - 161. Muraki, E. 1992 A generalized partial credit model: Application of an EM algorithm. Applied Psycho- logical Measurement 16: 159- 176. National Assessment Governing Board (NAGB). 1996 Science Framework for the 1996 National Assessment of Educational Progress. Wash- ington, D.C.: NAGB. Novick, M.R. 1966 The axioms and principal results of classical test theory. Journal of Mathematical Psy- chology 3: 1- 18. Rogers, H.J. 1996 POLYFIT. Unpublished computer program, Teachers College, Columbia University, New York, N.Y. Samejima, F. 1969 Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement 4(Part 2):Whole #17. Sireci, S.G. 1997 Dimensionality Issues Related to the National Assessment of Educational Progress. Com- missioned paper by the National Academy of Sciences/National Research Council's Committee on the Evaluation of National and State Assessments of Educational Progress. Washington, DC: National Research Council. Sireci, S.G., D. Thissen, and H. Wainer 1991 On the reliability of testlet-based tests. Journal of Educational Measurement 28:237-247. Tanaka, J.S., and G.J. Huba 1985 A fit index for covariance structure models under arbitrary GLS estimation. British Journal of Mathematical and Statistical Psychology 38:197-201. Thissen, D. 1991 MULTILOG: Multiple Categorical Item Analysis and Test Scoring Using Item Response Theory, Version 6. Computer program. Mooresville, Ind.: Scientific Software. Yen, W.M. 1993 Scaling performance assessments: Strategies for managing local item dependence. Jour- nal of Educational Measurement 30:187-214. Zhang, J. 1997 A New Approach for Assessing the Dimensionality of NAEP Data. Paper presented at the annual meeting of the American Educational Research Association, Chicago, March.

Next: 6 Subject-Matter Experts' Perceptions of the Relevance of the NAEP Long-Term Trend Items in Science and Mathematics »

Grading the Nation's Report Card: Research from the Evaluation of NAEP (2000)

Chapter: 5 Appraising the Dimensionality of the 1996 NAEP Science Assessment Data

Welcome to OpenBook!

Get Email Updates