Read "Grading the Nation's Report Card: Research from the Evaluation of NAEP" at NAP.edu

« Previous: 3 Student Thinking and Related Assessment: Creating a Facet-Based Learning Environment

Page 74 Cite

Suggested Citation:"4 An External Evaluation of the 1996 Grade 8 NAEP Science Framework." National Research Council. 2000. Grading the Nation's Report Card: Research from the Evaluation of NAEP. Washington, DC: The National Academies Press. doi: 10.17226/9751.

Page 75 Cite

Page 76 Cite

Page 77 Cite

Page 78 Cite

Page 79 Cite

Page 80 Cite

Page 81 Cite

Page 82 Cite

Page 83 Cite

Page 84 Cite

Page 85 Cite

Page 86 Cite

Page 87 Cite

Page 88 Cite

Page 89 Cite

Page 90 Cite

Page 91 Cite

Page 92 Cite

Page 93 Cite

Page 94 Cite

Page 95 Cite

Page 96 Cite

Page 97 Cite

Page 98 Cite

Page 99 Cite

Page 100 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

4 An External Evaluation of the 1996 Grade NAEP Science Framework Stephen G. Sireci, Frederic Robin, Kevin Meara, H. lane Rogers, and Hariharan Swaminathan The National Assessment of Educational Progress (NAEP) is the most com- prehensive evaluation of the educational achievement of U.S. students in history. Laudable features of the more recent NAEP tests are their breadth in terms of the content domains measured and the manner in which students are tested. For example, on the 1996 NAEP science assessment, the focus of this paper, three "fields" of science are measured earth, life, and physical science and students are required to perform "hands-on" science experiments, report the results of their experiments in written form, and respond to multiple-choice questions. Thus, the structure of the current NAEP science assessment is complex. This study examined the content validity1 of the 1996 grade 8 NAEP science assessment to determine how well items composing the assessment represent the framework that governed the test development process. This appraisal is impor- tant for determining whether the inferences derived from NAEP scores can be linked to the science content and skill domains the test is designed to measure. To accomplish the goals of this study, 10 carefully selected science teachers were recruited to review items from the 1996 grade 8 NAEP science assessment and provide judgments regarding the knowledge and skills measured by these items. 1Some measurement specialists (e.g., Messick, 1989) argue against use of the term content validity because it is does not directly describe score-based inferences. Although this position has theoretical appeal, in practice, content validity is a widely endorsed notion of test quality (Sireci, 1998b). Thus, the position taken here is similar to Ebel (1977:59), who claimed "content validity is the only basic foundation for any kind of validity.... One should never apologize for having to exercise judgment in validating a test. Data never substitute for good judgment." 74

S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN 75 These judgments were compared to the knowledge and skill domains the items were intended to measure. OVERVIEW OF THE GRADE 8 SCIENCE ASSESSMENT FRAMEWORK The 1996 grade 8 science assessment comprised 189 items. The intended structure of the assessment is characterized in the content frameworks, which specify four dimensions (National Assessment Governing Board, 19961. The first dimension is a content dimension comprising three separate "fields of science"- earth science, life science, and physical science. The committees involved in creating the test specifications concluded that these three fields of science are sufficiently unique as to warrant separate scales. Thus, for all 1996 NAEP science assessments, the results were to be reported along four separate scales: one for each of the three fields of science and a composite score scale summariz- ing science proficiency across the three fields. The second dimension of the science framework is a cognitive dimension described as "ways of knowing and doing science." There are also three compo- nents to this dimension: conceptual understanding, practical reasoning, and sci- entific investigation. Separate score scales are not derived for these cognitive skills; however, these skill areas were critical in defining the domains measured on the assessment and in governing the item (task) development process. Every item on a NAEP science assessment is targeted to one of the three fields of science and one of the three ways of knowing and doing science. Only some of the items were linked to the other two dimensions underlying the content frameworks. These two dimensions are described as a "themes of science" dimension and a "nature of science" dimension. The "themes" dimen- sion comprised three areas: patterns of change, models, and systems. The nature of science dimension comprised two areas: nature of science and nature of technology. For the grade 8 assessment, 93 items (49 percent) corresponded to a "theme" dimension and 31 items (16 percent) to a "nature" dimension. The content, cognitive, theme, and nature test specifications are presented in Table 4-1. Another conspicuous aspect pertinent to the content structure of the assess- ment is the diversity of item formats used. Students were required to both read assessment material and perform hands-on scientific experiments. The item formats tied to these tasks were multiple-choice items (with two to four response options per item); short constructed-response items (where students were required to write a short answer, usually a single word or a sentence or two); and extended constructed-response items (requiring students to supply a detailed response to the item). There were 73 multiple-choice and 1 16 constructed-response items on the grade 8 assessment.

76 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK TABLE 4-1 Cross-Tabulation of Item Specifications for 1996 Grade 8 NAEP Science Assessment Ways of Knowing and Doing Science Field of Conceptual Practical Scientific Science Understanding Reasoning Investigation Total Earth science 35 13 14 62 (33%) (Theme) (23) (7) (7) (37) [Nature] [2] [4] [5] [11] Life science 42 14 9 65 (34%) (Theme) (29) (5) (4) (38) [Nature] [2] [3] [O] [5] Physical science 32 16 14 62 (33%) (Theme) (9) (6) (3) (18) [Nature] [1] [9] [5] [15] Totals 109 (57.7%) 43 (22.8%) 37 (19.6%) 189 (61) (18) (14) (93) [5] [16] [10] [31] Note: Entries in the table are the number of items in each cell of the framework. METHOD Ten science teachers were recruited to scrutinize a carefully selected sample of items from the 1996 grade 8 science assessment and provide judgments regarding the content characteristics of the items. As described below, these teachers provided both ratings of the content similarities among the items and ratings linking each item to the content, cognitive, nature, and theme dimensions defined in the frameworks. Participants The 10 science teachers who served as the subject-matter experts (SMEs) in this study were selected by contacting the state assessment directors in states that are currently active in developing state standards and assessments in science. The teachers were nominated by their state assessment director because of their involvement in science assessment movements in their state. Three of the teachers previously served on a national working group, convened by the National Assess- ment Governing Board, that helped clarify the achievement-level standards set on the 1996 science assessment. Seven of the 10 SMEs were women. All had extensive experience teaching science. These SMEs represented the following

S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN 77 states: California, Delaware, Florida, Kentucky, Maryland, Ohio, South Carolina, Texas, Virginia, and Washington. The data from these SMEs were gathered during a two-day workshop in Washington, D.C. All SMEs received an hono- rarium for their participation. Items Selected for Analysis As noted above, 189 items comprised the grade 8 science assessment. Sixty items were selected for the purposes of this study to represent the test specifica- tions in terms of the content and cognitive dimensions as well as item format (multiple choice, short constructed response, extended constructed response). In addition, items were selected that represented a theme or nature of science area. These items came from 9 of the 15 blocks comprising the grade 8 item pool. Item-objective congruence ratings (described below) were obtained for all 60 items. However, because of time and subject fatigue limitations, a subset of 45 of these items was chosen for the item similarity ratings (also described below). Table 4-2 presents the test specifications for the 60-item subset, and Table 4-3 presents the test specifications for the 45-item subset. A comparison of Tables 4- 1 through 4-3 reveals that the percentages of items from each science field were relatively comparable across the item pool and the item subsets but that the two subsets had slightly more items measuring practical reasoning and scientific in- vestigation. Procedure SME Training Almost half (29) of the 60 NAEP items used in this study were associated with one of the four hands-on science tasks. Twelve of these 29 items were included in the similarity rating task involving the 45-item subset, and all 29 were included in the item-objective congruence rating task. Training of the SMEs began with a description of these hands-on tasks. The material kits for these tasks were presented to the SMEs, and an oral description of the experiments was provided. The descriptions focused on tasks the students were required to com- plete in conducting their experiments. Next, the judges were asked to complete a block of 14 test items as if they were students being tested. After completing the items, the judges were given the answer keys and asked to check their answers. Finally, the judges were given the operational test booklet sections for the nine- item blocks (i.e., all 60 items). The 45 items that were later used were high- lighted. The SMEs were given time to familiarize themselves with the items and the scoring protocols.

78 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK TABLE 4-2 Cross-Tabulation of Specifications for 60-Item Subset Used in Item-Objective Congruence Study Ways of Knowing and Doing Science Field of Conceptual Practical Scientific Science Understanding Reasoning Investigation Total Earth science 10 6 6 22 (37%) (Theme) (8) (1) (1) (10) [Nature] [O] [1] [1] [2] Life science 10 7 4 21 (35%) (Theme) (8) (3) (4) (15) [Nature] [O] [1] [3] [4] Physical science 7 4 6 17 (28%) (Theme) (0) (2) (1) (3) [Nature] [1] [2] [3] [6] 27 (45.0%) 17 (28.3%) 16 (26.7%) 60 (16) (6) (6) (28) [1] [4] [7] [12] Note: Entries in the table are the number of items in each cell of the framework. TABLE 4-3 Cross-Tabulation of Specifications for 45-Item Subset Used in Item Similarity Rating Study Ways of Knowing and Doing Science Field of Conceptual Practical Scientific Science Understanding Reasoning Investigation Total Earth science 9 3 3 15 (33%) (Theme) (8) (1) (0) (9) [Nature] [O] [1] [O] [1] Life science 7 6 3 16 (36%) (Theme) (0) (3) (3) (6) [Nature] [O] [1] [2] [3] Physical science 6 3 5 14 (31%) (Theme) (1) (1) (1) (3) [Nature] [1] [2] [3] [6] Totals 22 (48.9%) 12 (26.7%) 11 (24.4%) 45 (9) (5) (4) (18) [1] [4] [5] [10] Note: Entries in the table are the number of items in each cell of the framework.

S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN Item Similarity Ratings 79 Following these item familiarization steps, instructions for completing the item similarity ratings were provided. The SMEs were informed that they would be required to review pairs of NAEP items and provide a judgment regarding the similarity of the items in each pair to one another in terms of the science knowl- edge and skills tested. These instructions were intentionally general so that the SMEs' ratings were not influenced by anyone else's preconceived notions of what the items were measuring. Therefore, the content specifications for these items, and the content frameworks for the test, were not described to the SMEs. To facilitate understanding of the item similarity rating task, three "practice" item pairs were distributed to the judges. The first pair involved two multiple- choice items; the second pair involved a short constructed-response item and an extended constructed-response item; and the third pair involved two extended constructed-response items. Each item pair was printed on a single page, and an eight-point similarity rating scale was printed at the bottom of each page. The numeral "1" on the scale was labeled "very similar," and the numeral "8" was labeled "very different." The SMEs rated the similarities among these three item pairs individually and then discussed the ratings as a group. The SMEs with the highest and lowest ratings for each item pair described the characteristics of the items that influenced their ratings. Common factors cited were cognitive com- plexity of the item and science content area the item was measuring. The SMEs were told that they were on task and were each given an item similarity rating booklet. The pages of these booklets each contained one item pair, with the same eight-point rating scale printed at the bottom of each page. A sample item similarity rating page is presented in Figure 4-1. Consideration of all possible item pairings among the 45 items involved 990 item comparisons (~45 x 441/2~. Given the time constraints of the study, the judges were required to rate only 700 of these 990 possible item pairings. Ten separate booklets were created. Each booklet represented a different ordering of the item similarity pairs to control for a systematic item order effect. The 700 ratings required of each SME were selected such that for each item pair seven independent ratings would be provided. Five of the SMEs finished relatively early and completed some of the "missing" 290 ratings. In addition to the 700 required ratings, six specific item pairs were repeated in each booklet. These repetitions were included to provide an estimate of the reliability of the SMEs' ratings. The six replicated item pairs were placed near the end of each booklet, when the deleterious effects of fatigue and boredom were most likely to be present. Thus, the error associated with the similarity ratings as measured by these replicated item pairs most likely represents a worst-case scenario. Upon completion of the item similarity ratings, the SMEs responded to a short questionnaire on which they listed the criteria they used in making the item similarity ratings. The questionnaire asked the SMEs how long they took to

80 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK 2. The instrument shown is used to measure wind direction wind speed air pressure relative humidity IL001078 S. A space station is to be located between the Earth and the Moon at the place where the Earth's gravitational pull is equal to the Moon's gravitational pull. On the diagram below, circle the letter indicating the approximate location of the space station. Earth Explain your answer. A B C Moon 1 1 ~O HE001703 Very Similar Very Different 1 2 3 4 5 6 7 8 FIGURE 4-1 Sample item similarity rating sheet. Items are from National Center for Education Statistics, U. S. Department of Education, 1996 National Assessment of Edu- cational Progress in Science released items; available at http://nces.ed.gov/naep.

S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN 81 complete the similarity ratings and listed seven item characteristics that were anticipated to influence their ratings: science discipline measured by each item, cognitive level measured by each item, item format, item difficulty, item length, item themes, and historical origin of each item. Space on the questionnaire was also provided for the SMEs to add any additional criteria they used that were not included on the list. Item-Objective Congruence Ratings The purpose of the item similarity rating task was to obtain the SMEs' "independent" appraisal of the knowledge and skills measured by the items (i.e., independent of knowledge of the content, cognitive, nature, and theme dimen- sions that governed item development). In this manner it was hoped that the content specifications for these 45 items would be "recovered" rather than con- firmed. Thus, the similarity rating task tested the adequacy of the dimensions underlying the framework, given the items that were developed. For the item-objective2 congruence ratings, the SMEs were given an oral presentation describing the NAEP science frameworks as well as the public docu- mentation of these frameworks (NAGB, 1996~. The SMEs were then presented with a new booklet that listed the item numbers for each block (60 items total) and series of columns under which they were to provide ratings for each item. The task presented to the SMEs was to indicate their opinion regarding the "field of science," "way of knowing and doing science," "theme of science," and "na- ture of science" classification of each item. They were informed that each item was classified by the test developers into one of the three "fields" and into one of the three "ways" dimensions but that only some items were classified as a "na- ture" item or a "theme" item. These data provided a check on whether the SMEs would classify the items in a manner congruent with their test specifications. A sample item-objective congruence rating page is presented in Figure 4-2. Exit Survey Upon completion of the item-objective congruence ratings, the SMEs were given a brief survey. This survey asked them about their confidence in the similarity and congruence ratings they provided and asked them to provide sug- gestions for future research in this area. In addition, the survey asked the SMEs about their experience with science assessment standards at the local, state, and national levels and asked them to describe how well the NAEP science materials matched national, state, and local science standards. 2The term objective is used here in a general sense to describe the specific field of science, way of knowing and doing science, theme of science, and nature of science designations for each item.

82 ~ 1 1 11 1 idol I T I I T I T I I T I I 1 5 ~ 1 1 1 in T~TTTTT ~1: T~TTTTT ~ ~ ~1 1 1 1 1 1~ ~IIIII ~1 1 1 1 1 d BUTT I I I LTfTTTTT so to = Ct so o = ;~ o · ~ Ct V:

S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN Data Analysis 83 The item similarity ratings were analyzed using multidimensional scaling (MDS). The purpose of MDS is to portray the similarities among objects visually, as in a map (Schiffman et al., 1981~. This visual portrayal is accomplished by scaling the items along as many continuous dimensions as are necessary to adequately represent the similarity ratings. Each stimulus dimension in an MDS solution corresponds to an attribute or characteristic of the objects being scaled. The purpose of this analysis was to determine whether dimensions, such as those specified in the NAEP frameworks, would be perceived by the SMEs and whether the items would be configured in the MDS space in a manner congruent with the test specifications. The model used was an "individual differences" or weighted MDS model. Weighted models allow for the scaling of SMEs in the same MDS space in which the items are configured. Thus, by using a weighted MDS model, similarities and differences among the SMEs, as well as among the items, could be observed. The weighted MDS model used was the INDSCAL model (Carroll and Chang, 1970) implemented in the ALSCAL procedure in SPSS, version 7.5 (Young and Harris, 1993~. The distances among items and the dimensional weights for the SMEs are computed using the weighted distance formula developed by Carroll and Chang (1970~. In the INDSCAL model the similarity data for each subject are trans- formed to derive coordinates on dimensions that are used to scale the items in Euclidean space. The perceptual space for each subject is related to a common "group space" by weighting the dimensions of the group space separately for each subject. That is, each subject's coordinate matrix is multiplied by a vector of weights (w) consisting of elements wka that represent the relative emphasis subject k places on dimension a. The distances between stimuli are computed by incorporating this weighting factor into the Euclidean distance formula used by classical MDS. The INDSCAL model defines the distance between two objects . . . ~ anus as: dijk = ~ I, Wka (Xia Ma ) a=1 where: dijkis the Euclidean distance between points i andj for subject k, Xia is the coordinate of point i on dimension a, and r is the number of specified dimensions. The INDSCAL analysis provides a multidimensional configuration of the attributes rated (the stimulus, or item space) and a multidimensional configuration of the subjects (the group, or SME space). To facilitate interpretation of the MDS solutions, external information on the items was analyzed together with the MDS item coordinates. These external data included item difficulties, the item-objective congruence ratings, and dichotomous

84 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK "dummy variables" reflecting the item content specifications (i.e., field, ways, theme, and nature designations for each item). These data were correlated with the coordinates from the MDS solution to determine whether the dimensions were related to these item attributes. RESULTS Although the SMEs completed the item similarity ratings before they com- pleted the item-objective congruence ratings, the results of the item objective congruence ratings are presented first. These results involve all 60 items used in this study and are helpful for subsequent interpretation of the MDS results. Item-Objective Congruence Ratings Tables 4-4 through 4-7 summarize the results of the item-objective congru- ence ratings. An item was considered to be "correctly" matched to its framework designation if at least 7 of the 10 SMEs placed it in the same category that was specified in the test blueprint. In addition to providing the percentages of items correctly classified by the SMEs, these tables present the number of "unanimous" matches (i.e., all 10 SMEs correctly classified the item) and stem-and-leaf plots of the SMEs' ratings. Those ratings pertaining to the field of science dimension of the NAEP framework are presented in Table 4-4. More than half of the items (31, or 52 percent) were unanimously matched to their fields of science specified in the test blueprint. Only nine items failed to be correctly matched to their corresponding fields by at least seven SMEs, yielding an item-objective congruence index of 85 percent for the 60 items. Three of the "misclassified" items were earth science items that were classified as physical science by at least eight SMEs. Four other items were physical science items, three of which were predominantly rated as earth science and one as life science. The two remaining misclassified items were life science items, one of which nine SMEs classified as earth science; the other item was classified as life science; by only six SMEs. The percentages of correct classifications for the earth, life, and physical science fields were 86, 90, and 76 percent, respectively. These results indicate that in general the SMEs supported the field of science designations of the items. However, they did not "agree" with the operational content classifications for 15 percent of the 60 items. The results for the cognitive dimension (ways of knowing and doing science) are presented in Table 4-5. The correct classifications were relatively lower for this dimension than for the field of science dimension. Using the same "7 of 10" SME criterion, only 60 percent of the items were matched to their cognitive area specified in the test blueprint. Unanimous ratings were observed for only eight items, all of which were conceptual understanding items. The percentages of correct classifications for the conceptual understanding, practical reasoning, and

S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN TABLE 4-4 Summary of Item-Objective Congruence Ratings: Field of Science 85 Stem-and-Leaf Plots of SMEs' Congruence Ratings Earth (22 items) Life (21 Items) Physical (17 Items) o 1 2 3 4 5 6 7 8 9 10 111 1 1 1 11 2 2 22 3 3 4 4 6 6 6 7 7 7 7 888 8 8 8 8 99999 9 999 9 99999 0000000000 10 00000000000000 10 0000000 Summary of Content-Area Classifications Field of Number of Items Classified Correctly Items Classified Correctly Science Items by All SMEs (%) by at Least Seven SMEs (%) Earth 22 45 86 Life 21 71 90 Physical 17 41 76 Average 53 85 Note: "Leaves" represent the number of SMEs correctly classifying each item, with O indicating all 10 SMEs correctly classified the item. scientific investigation cognitive areas were 70, 53, and 50 percent, respectively. These results suggest that the cognitive classifications of these items are more equivocal than their content classifications. The results for the themes of science dimension are summarized in Table 4-6. The test development committee designated only 28 of the 60 items as corresponding to one of the three themes of science areas. However, the SMEs considered most of these items to be measuring this dimension. At least three SMEs linked each of these nontheme items to a theme of science area. Thus, the most common misclassification "error" made by the SMEs was classifying an item as a theme item when in fact it was not. For those items designated as theme items in the test blueprint, only 50 percent were correctly classified. The "pat- terns of change" theme exhibited the highest correct classification rate (six of eight items were classified correctly). The models and systems theme areas exhibited correct classification percentages of 22 and 55 percent, respectively.

86 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK TABLE 4-5 Summary of Item-Objective Congruence Ratings: Ways of Knowing and Doing Science Stem-and-Leaf Plot of SMEs' Congruence Ratings Conceptual Understanding Practical Reasoning Scientific Investigation O * O O * 1 111 1 1 1 2 2 2 2 2 22 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6666 6 6 7 777 7 777 7 7 8 888 8 8 8 88 9 99999 9 99999 9 99999 10 00000000 10 10 Summary of Cognitive-Area Classifications Ways of Knowing Number Items Classified Correctly Items Classified Correctly of Items by All SMEs (%) by at Least Seven SMEs (%) Conceptual understanding 27 30 70 Practical reasoning 17 0 53 Scientific investigation 16 0 50 Average 13 60 Notes: "Leaves" represent number of SMEs correctly classifying each item, with O indicating all 10 SMEs correctly classified the item. Only two items were classified correctly by all 10 SMEs, both of which were "systems" items. The item-objective congruence ratings for the nature of science dimension of the framework are summarized in Table 4-7. The SMEs were not asked to indicate whether the items were "nature of science" or "nature of technology" but rather only to indicate whether the item corresponded to the nature of science dimension. Only 10 of the 60 items were designated as corresponding to this dimension by the test development committee. Of these 10 items, 9 were cor- rectly identified as nature of science items by at least eight SMEs; the other item was correctly classified by five of nine SMEs (one SME omitted the rating for this item). Although these results appear to support the nature of science classi- fication, the SMEs tended to rate almost all of the items as corresponding to this

S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN TABLE 4-6 Summary of Item-Objective Congruence Ratings: Themes of Science 87 Stem-and-Leaf Plot of SMEs' Congruence Ratings Patterns of Change Models Systems 1 2 3 4 5 6 7 8 9 10 O *********** O . ******************** O . *** 1111111 1 1111111111111111111111 1 11111111111 22222222 2 2222222 2 222222222222222222222 33333333 3 3333 3 3333333333 44444 4 4 44444 5555 5 55 5 55 666666 6 66 6 66 7777777 7 7 7 7 88 8 8 8 888 99 9 9 10 0 10 00 Summary of Theme of Science Classifications Themes of Items Number Items Classified Correctly Items Classified Correctly by All SMEs (%) by at Least Seven SMEs (%) No theme 32 0 3 Patterns of change 8 0 75 Models 9 0 22 Systems 11 18 55 Average 3 25 Notes: All 60 items are represented in each theme area. Entries indicate number of SMEs classifying each item into the theme area, with O indicating all 10 SMEs and * indicating one SME. Correct classifications are indicated in boldface. dimension. For the 50 items not listed as nature of science in the test blueprint, the mean number of SMEs linking them to the nature dimension was 7.3. In fact, 20 of these items (40 percent) were unanimously judged to correspond to this dimension. Only two items were linked to this dimension by three or fewer SMEs. Analysis of the exit survey data revealed that the SMEs were fairly confident in the validity of their item-objective congruence ratings. When asked how confident they were regarding how well their ratings reflected the way the items "should truly be classified," the median confidence rating on an eight-point scale (where 8 = very confident) was 7. The confidence ratings ranged from 5 to 8.

88 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK TABLE 4-7 Summary of Item-Objective Congruence Ratings: Nature of Science Stem-and-Leaf Plot of SMEs' Congruence Ratings Nature of Science o 1 2 3 4 5 6 7 8 9 10 2 333 444444 555 666666 777 8888888 999999999999999999999 0000000000 Summary of Nature of Science Classifications Themes Number Items Classified Correctly Items Classified Correctly of Items by All SMEs (%) by at Least Seven SMEs (%) No theme 50 0 4 Nature Science 8 75 88 Nature of technology 2 50 100 Average 12 18 Notes: A1160 items are represented in each theme area. Entries indicate number of SMEs classifying each item into the theme area, with O indicating 10 SMEs. Correct classifications are indicated in boldface. MDS Results All SMEs completed the item similarity ratings within six hours. The short- est completion time was three hours, and the median completion time was 5.25 hours. Analysis of the follow-up surveys indicated that all 10 SMEs used the science discipline, cognitive level, and item format characteristics of the items in making their similarity judgments. Nine of the SMEs also reported using the difficulty level of the item, six SMEs reported using item themes, and four reported using the length of the item in making their judgments. Other similarity rating criteria reportedly used by one or more SMEs included consideration of the "learning styles of students," the number of steps required to complete a problem, item vocabulary considerations, perceived grade level of the items, and visual or reading cues. All SMEs seemed to stress particular attention paid to cognitive attributes of the items in responding to the open-ended question regarding criteria

S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN 89 used to make their item similarity ratings. When asked how confident they were that their item similarity ratings accurately reflected the "content and cognitive similarities among the item pairs," the median confidence rating obtained (on the same eight-point scale, where 8 = very confident) was 6.5. The confidence ratings ranged from 4 (SME #10) to 8. For each SME the six item pairings repeated in each booklet were evaluated to provide an index of the reliability of their ratings. Across these 60 ratings (10 SMEs x 6 item pairs) only one differed by as much as four points on the eight- point scale (a pair originally rated by SME #4 as 8 was later rated as 4), and two other pairs differed by three points (original ratings of 6 were later rated as 3 by SMEs #2 and #3 ). The vast majority of the replicated ratings (80 percent) were within one point of one another, and 38 percent were identical. In looking at the average discrepancy of ratings for each SME, 7 of the 10 SMEs had average discrepancies less than one point across the replicated pairs. The largest discrep- ancy was 1.5 points, for the SME who had the four-point discrepancy noted above. The median discrepancy across the 10 SMEs was 0.73. These results suggest that in general the similarity ratings can be considered reliable; however, it is likely that some specific item pairings for some SMEs are probably unreli- able, which is not surprising given the large number of ratings completed. How- ever, given that the replicated ratings were made toward the end of the rating task and that the average discrepancies for these pairs were small, it does not appear that the SMEs' similarity ratings are undermined by low reliability. INDSCAL Model Fit to the Data Two- through six-dimensional MDS solutions were applied to the data. Model-data fit and interpretability of the solution were used to select the appro- priate dimensionality of the data. The fit values of STRESS (departure of data from the model) and R2 (proportion of variance in the SMEs' similarity data accounted for by the model) are reported in Table 4-8. Using the rules of thumb and heuristics suggested by Kruskal and Wish (1978), MacCallum (1981), and Dong (1985), at least four dimensions appear to be required to adequately fit the data. Very little improvement in fit occurs in adding a sixth dimension. Further TABLE 4-8 Summary of Fit Indexes from MDS (INDSCAL) Solution Number of Dimensions in Solution STRESS R2 6 .12 .75 5 .14 .75 4 .16 .71 3 .20 .70 2 .25 .67

90 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK TABLE 4-9 Summary of SME Fit Statistics and Dimension Weights Subject Weights Dimension SME Stress R2 1 2 3 4 5 Weirdness 1 .153 .655 .48 .28 .33 .33 .36 .15 2 .125 .803 .39 .39 .60 .28 .26 .31 3 .137 .749 .54 .54 .16 .29 .24 .28 4 .140 .717 .45 .53 .27 .29 .28 .14 5 .123 .797 .63 .46 .25 .27 .25 .21 6 .121 .812 .70 .34 .28 .22 .29 .27 7 .153 .658 .32 .53 .31 .30 .30 .17 8 .129 .818 .27 .29 .73 .26 .28 .47 9 .163 .615 .27 .31 .13 .49 .43 .40 10 .098 .853 .47 .48 .21 .47 .37 .22 more all dimensions from the five-dimensional solution were interpretable (see below), but the sixth dimension in the six-dimensional solution was not readily interpretable. Thus, the five-dimensional solution was selected as the appropriate model for these data. As indicated in Table 4-8, the five-dimensional solution accounted for 75 percent of the variance in the SMEs' (transformed) similarity rating data. The total variance in these data accounted for by each dimension was 22, 18, 14, 11, and 10 percent, respectively, for dimensions one through five. SME Congruence The model-data fit values for each SME are presented in Table 4-9. The model fit the data for SMEs 1, 7, and 9 least well (R2 less than .7 and STRESS greater than .15~; however, these levels of fit are on par with those found in previous research (e.g., Deville, 1996; Sireci and Geisinger, 1992, 1995~. The congruence among the SMEs was evaluated by inspecting the individual subject weights and the subject weirdness indexes.3 Although differences were observed in the weighting of the dimensions across SMEs, all SMEs appeared to be using all five dimensions in making their similarity ratings. Figure 4-3 presents separate two-dimensional subspaces from the five-dimensional SME weight space. These two subspaces highlight the differences among the SMEs. SME #8 exhibited the 3The weirdness index describes the relative weightings of the dimensions for each subject in proportion to the average dimension weights across all subjects. A subject with a large weight on one dimension and small weights on the other dimensions would have a weirdness index near one, which is the maximum value. Subjects with dimension weights proportional to the average weights have weirdness indexes near zero, which is the minimum value (see Young and Harris, 1993, for the full details).

S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN A .8 red .e ° .6 to .5 Q Q ~ .4 . _ .3 CO to CO ~ .2 .E 7 .1 o o B .8 ~ .6 A_ u) ~ .5 c) . _ u) .4 U. o u, .3 . _ ~ .2 7L .1 o o 8 9 7 4 10 3 6 5 1 1 1 1 1 1 1 0.0 .1 .2 .3 .4 .5 .6 Dimension 1 (Conceptual Understanding) .7 .8 9 6 8 4 Cad ~3 1 10 0.0 .1 ~A .2 .3 .4 .5 .6 .7 .8 Dimension 4 (Life vs. Earth) 91 FIGURE 4-3 Two-dimensional subject weight subspaces. (a) dimensions 1 and 3; (b) dimensions 4 and 5.

92 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK largest weirdness index, due to his relatively large emphasis on dimension 3 (see Figure 4-3a). SME #9 and #10 had relatively larger weights on dimensions four and five (see Figure 4-3b). As described below, these two dimensions corre- sponded to the field of science characteristics of the items. Thus, these two SMEs emphasized content characteristics in their similarity ratings, whereas the other SMEs tended to emphasize cognitive characteristics of the items. Although these differences are interesting, the subject weights indicate that all five dimensions were used by all SMEs in making their ratings. Thus, we turn now to interpreta- tion of these five dimensions. Interpreting the Dimensions The dimensions were interpreted visually and with the assistance of statisti- cal analyses comparing known item characteristics with the item coordinates from the MDS solution. Visual interpretations were made separately by the first author and by a science content expert from the National Academy of Sciences. The statistical analyses involved computing correlations among the MDS item coordinates and content, cognitive, and format item attributes. Because of the overlap of item characteristics (e.g., more of the practical reasoning items were also extended constructed-response items and most of the nature of science items were scientific investigation items), the visual interpreta- tions were able to clarify some of the multiple interpretations that could be attributed to the dimensions using only the statistical results. Based on the (subjective) visual and (objective) statistical information, the following interpre- tations were given to the dimensions: dimension 1 is a "conceptual understand- ing" cognitive dimension that separates the "lower-order" cognitive skill items (e.g., factual recognition items) from those items requiring higher-order skills (e.g., design an experiment, interpret results); dimension 2 is an item format dimension that separates the multiple-choice items from the constructed-response items; dimension 3 is "practical/applied reasoning" cognitive dimension that sepa- rates the practical reasoning items from the scientific investigation items; dimen- sion 4 is a content dimension that separates the life science items from the earth science items; and dimension 5 is a content dimension that separates the physical science items from the life science items. Thus, the first three dimensions are related to cognitive item attributes, and the fourth and fifth dimensions are related to content item attributes. Figure 4-4 presents the two-dimensional item subspace for dimensions 1 and 2. A conspicuous "chasm" can be seen above the origin of dimension 1 (horizon- tal). This chasm roughly separates the lower cognitive level "conceptual under- standing" (C) items (positive coordinates, or right side of the figure) from the higher-level "scientific investigation" (S) items (negative coordinates). Three conceptual understanding items have negative coordinates on this dimension; however, these same three items were rated as measuring higher-level cognitive

S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN 2.0 1.5 1.0 a_ Cd o .5 IL Q) 0.0 C\l o In Q) .E S pi P ~C ~ - ~S Cp ~p P ~. ~C S S ~P ~C -1.0 S ~C -1.5 _ S C C C C S TIC -2.0 1 1 1 1 1 1 1 -2.0 -1.5 -1.0 -.5 0.0 .5 1.0 1.5 2.0 Dimension 1 (Conceptual Understanding) 93 FIGURE 4-4 Two-dimensional MDS stimulus subspace: items plotted along dimensions 1 and 2 using cognitive classification symbols. C, conceptual understanding; P. practical understanding; S. scientific investigation. areas by the SMEs in the item-objective ratings, as described earlier. Similarly, the two scientific investigation items with positive coordinates on this dimension tended to be "misclassified" with respect to cognitive area by the SMEs. Dimen- sion 2 (vertical) separates the practical reasoning items from the others; however, all of the practical reasoning items, except one, were also constructed-response items. Figure 4-5 presents the same configuration but labels the items according to item format. As can be seen from this figure, all of the multiple-choice items have negative coordinates on dimension 2. Figure 4-6 presents the item configuration for the two-dimensional subspace formed by dimensions 1 and 3. All but two of the scientific investigation items have negative or near-zero coordinates on dimension 3. Both of these items exhibited low item-objective congruence for scientific investigation. Similarly, all but two of the practical reasoning items had positive coordinates on dimension 3, both of which also had low item-objective congruence ratings for the practical . . . reasomng cogmt~ve area. Figure 4-7 presents a three-dimensional subspace comprising the first three dimensions, which were related to cognitive area. Although some cognitive area

94 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK 2.0 1.5 1.0 Cd o .5 IL Q) 0.0 C\l o . _ .E S So E SE Em S ~ S M -.5 E S ~M S M -1.0 _ -1 .5 -2.0 M S ~ M M M M M M -2.0 -1 .5 -1 .0 -5 0.0 .5 1.0 1.5 2.0 Dimension 1 (Conceptual Understanding) FIGURE 4-5 Two-dimensional MDS stimulus subspace: items plotted along dimensions 1 and 2 using item format symbols. E, extended constructed-response; M, multiple- choice; S. short constructed-response. 4.0 3.0 .e cat 2.0 In Q) 1.0 Cd Q) 0.0 c) o O -1 .0 o tn -2.0 .E -3.0 -4.0 8 ~ cp ~ ~ p P P S S ACE S ~ Son Cusp S S ~S UP P p TIC ~ 11 1 11 1 1 -4 -3 -2 -1 0 -.1 2 3 4 Dimension 3 (Practical/Applied Reasoning) FIGURE 4-6 Two-dimensional MDS stimulus subspace: items plotted along dimensions 1 and 3 using cognitive classification symbols. C, conceptual understanding; P. practical reasoning; S. scientific investigation.

3.0 2.0 Dimen 2 1.0 0.0 -1 .0 -2.0 _ 3.n 2.0 1.0 Dimen. 1 S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN C C ~.C Cat -~ PA ; -~ / pit < MAP ~ or' ~O p 5 #;',~ ,,.~jj~f<~ j\\~C 0~0 = ~ 0~0 95 ~ ° o 2.0 Dimen. 3 FIGURE 4-7 Two-dimensional MDS stimulus space illustrating cognitive groupings among grade 8 NAEP science items. C, conceptual understanding; P. practical reasoning; S. scientific investigation. overlap is evident, clusters of items from the same content area occupy segre- gated regions of the subspace. In particular, the conceptual understanding items are primarily arranged in the left side of the figure (a tight cluster of these items appears in the lower left), and the practical reasoning items are configured near the top of the space. Figure 4-8 illustrates the two-dimensional "content" subspace formed by dimensions 4 and 5. Dimension 4 (horizontal) tended to segregate the earth science (E) items (positive coordinates) and life science (L) items (negative coor- dinates). All but one of the life science items had negative coordinates on dimen- sion 4. This item was classified as a life science item by seven of the 10 SMEs. Dimension 5 (vertical) appears to account for the degree to which the items measured physical science. Most physical science items had relatively large negative coordinates on this dimension; only one physical science item had a large positive coordinate. This item was classified as an earth science item by 8 of the 10 SMEs. Although some overlap among content areas is evident, in

96 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK 3.0 ~ .0 _` . _ ~ 1.0 It . _ to 0.0 up o . _ ~ -1 .0 . _ -2.0 -3.0 p E -3 -2 -1 0 1 2 3 Dimension 4 (Life vs. Earth) FIGURE 4-8 Two-dimensional MDS stimulus subspace: items plotted along dimensions 4 and 5 using content classification symbols. E, earth science; L, life science; P. physical science. general the items comprising the three different fields of science tend to be segregated in the subspace. In particular, most of the life science items are configured more closely to one another than they are to items from other content areas. To assist in verifying the visual interpretations given to the dimensions, correlations were computed between the MDS coordinates and external data on the items. These external data included the item-objective congruence ratings; item format information; and the content, cognitive, nature, and theme designa- tions of the items. The content, cognitive, nature, and theme designations were "dummy" coded for this analysis. For example, an earth science dummy variable was created by coding all earth science items "1" and all other items "O." The cognitive, theme, and nature areas were also dummy coded, as well as an item format variable (multiple-choice/constructed-response). Two separate correla- tional analyses were conducted. The first analysis correlated the item-objective congruence ratings with the item coordinates. To conduct this analysis, the number of SMEs categorizing an item in each content, cognitive, nature, or theme

S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN 97 area was calculated. These sums were then correlated with the MDS coordinates. The second analysis correlated the dummy variables with the item coordinates. The results of the correlation analyses are presented in Table 4-10 (item- objective congruence correlations) and Table 4-1 1 (dummy variable correlations). Both sets of correlations lead to similar conclusions regarding the item character- istics defining each dimension. However, the correlations based on the item objective congruence data tended to be larger. The largest correlations for the coordinates on the first dimension were with the conceptual understanding and scientific investigation cognitive areas. The largest correlation for the second dimension was for the item format variable. For the third dimension, large correlations with the practical reasoning and scientific investigation cognitive areas were observed. The nature of the science dummy variable also exhibited a large correlation with this dimension, but the nature of science item-objective congruence ratings did not. This finding probably stems from the fact that 5 of the 10 nature of science items were also scientific investigation items. The coordinates from the fourth and fifth dimensions exhibited large correlations with the variables associated with the field of science designations of the items. Thus, in general, the correlation analyses supported the visual interpretations given TABLE 4-10 Correlations Among MDS Item Coordinates and Item Objective Congruence Ratings Item Dimension Variable123 4 Fields Earth science-.04-.15-.01 .61 * .21 Life science .06 .22 -.04-.65* .48* Physical science -.02 -.09 .07-.07 -.75* Ways of Knowing and Doing Conceptual understanding .80* -.51 * -.17.01 .18 Practical reasoning -.27 .56* -.43*-.12 .16 Scientific understanding -.71 * .05 .66*.11 -.38 Themes Models .07 -.03 -.08.68* .20 Patterns -.57* .10 .10-.14 -.02 Systems .43 * -.02 -.08-.49* .22 Nature Yes -.72* .55* .27.14 -.11 No .71* -.52* -.13-.13 .17 *P < .01.

98 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK TABLE 4-11 Correlations Among INDSCAL Item Coordinates and Item Dummy Variables ItemDimension Variable 1 2 3 4 5 Fields Earth science .02-.04 .03 .61* .02 Life science -.02 .14 -.14-.58* .52* Physical science .00 -.10 .11-.02 -.57* Ways of Knowing and Doing Conceptual understanding .57 * -.43 * -.03.05 .08 Practical reasoning -.16 .57* .40*-.18 .17 Scientific investigation -.49* -.12 -.40*.14 -.28 Themes Models -.11 .06 -.11.61 * .17 Patterns -.36 .02 .02-.10 .40* Systems .28 .09 -.14-.23 .25 Nature Science -.44* .29 -.46*.08 -.04 Technology .08 .24 .29-.08 .01 Multiple Choice (Yes/No) .40* -.76* .06.12 .1 A Difficulty .09 -.52* .28.01 .02 *P < .01. earlier. The first three dimensions correspond to cognitive and item format attributes, and the fourth and fifth dimensions correspond to fields of science attributes. In summary, analysis of the item similarities data using MDS uncovered cognitive- and content-related dimensions that were congruent with those dimen- sions specified in the National Assessment Governing Board frameworks. Items that did not tend to group together with other items in their content or cognitive area tended to be the same items that were identified as problem items from analysis of the item-objective congruence ratings. DISCUSSION A fundamental requirement in educational assessment is operationally defin- ing the constructors) measured. Content validation involves determining whether

S. G. SIRECI, F. ROBIN, K MEARA, H.J. ROGERS, AND H. SWAMINATHAN 99 a test actually represents the intended construct. Thus, it is an important step in evaluating the validity of inferences derived from test scores. As Sireci (1998b: 106) has stated, "if the sample of tasks comprising a test is not represen- tative of the content domain tested, the test scores and item response data used in studies of construct validity are meaningless." Tests used in NAEP are operationally defined using test frameworks. This study sought to evaluate the content validity of a particular test in the NAEP battery the 1996 grade 8 science assessment. An independent panel of science educators was convened, and these experts provided judgments of the content characteristics of items from this test over a two-day period. Two distinct methods for evaluating content validity were used, and both methods provided similar conclusions regarding how well a carefully selected subset of items represented the framework dimensions. Does the grade 8 1996 NAEP science assessment measure what it purports to measure? The results from this study suggest that, in general, the two major dimensions composing the framework were supported by the SMEs' judgments. The majority of the items studied (85 percent) were judged to be measuring the content areas they were designed to measure. Although less congruence was observed for the cognitive classifications of the items, it was clear the SMEs thought that both higher- and lower-order thinking skills were measured across all three fields of science. These two major dimensions ("fields of science" and "ways of knowing and doing science") were also uncovered from the SMEs' item similarity ratings taken before the SMEs were aware of these dimensions. Sireci (1998a, 1998b) argues that this type of rating task provides a more rigorous appraisal of content validity. Thus, the results of the item-objective congruence and MDS analyses provide strong evidence that the content and cognitive dimen- sions of the framework were represented well by the actual items composing the assessment. However, given the fact that 15 percent of the studied items were classified differently by the SMEs with respect to field of science, a concern remains regarding which items to include in which field of science scale when the data are scored, calibrated, and reported. It is also interesting that the SMEs saw cognitive distinctions among the items first and foremost, before distinguishing among the items in terms of the fields of science content areas. The item-objective congruence ratings, and the dimensions observed in the SME-derived MDS solution, did not strongly support the themes of science or nature of science dimensions of the framework. However, like the ways of knowing and doing science dimension, separate scores are not reported for these dimensions, and including them in the frameworks probably enhanced item devel- opment and contributed to the overall quality of the item pool. The lack of congruence between the SMEs and test developers regarding these two dimen- sions may be due to problems in the item classifications or to a lack of clarity regarding the descriptions of these dimensions. Thus, the utility of these two dimensions deserves further study.

100 EXTERNAL EVALUATION OF THE 1996 GRADE ~ NAEP SCIENCE FRAMEWORK Although the results of this study are encouraging, they are limited only to the 1996 grade 8 science assessment. Similar studies are recommended for other tests in the NAEP battery. ACKNOWLEDGMENTS The authors thank Karen Mitchell, Lee Jones, and Holly Wells for their invaluable assistance with this research and an anonymous reviewer for helpful comments on draft of this paper. REFERENCES Carroll, J.D., and J.J. Chang 1970 An analysis of individual differences in multidimensional scaling via an e-way generali- zation of "Eckart-Young" decomposition. Psychometrika 35:238-319. Deville, C.W. 1996 An empirical link of content and construct validity evidence. Applied Psychological Measurement 20:127-139. Dong, H. 1985 Chance baselines for INDSCAL's goodness of fit index. Applied Psychological Mea- surement 9:27-30. Ebel, R.L. 1977 Comments on some problems of employment testing. Personnel Psychology 30:55-63. Kruskal, J.B., and M. Wish 1978 Multidimensional Scaling. Newbury Park, Calif.: Sage. MacCallum, R. 1981 Evaluating goodness of fit in nonmetric multidimensional scaling by ALSCAL. Applied Psychological Measurement 5:377-382. Messick, S. 1989 Validity. Pp. 13-103 in Educational Measurement, 3rd ea., R. Linn, ed. Washington, D.C.: American Council on Education. National Assessment Governing Board (NAGB) 1996 Science Framework for the 1996 National Assessment of Educational Progress. Wash- ington, D.C.: NAGB. Schiffman, S.S., M.L. Reynolds, and F.W. Young 1981 Introduction to Multidimensional Scaling. New York: Academic Press. Sireci, S.G. 1998a Gathering and analyzing content validity data. Educational Assessment 5:299-321. 1998b The construct of content validity. Social Indicators Research 45:83-117. Sireci, S.G., and K.F. Geisinger 1992 Analyzing test content using cluster analysis and multidimensional scaling. Applied Psy- chological Measurement 16: 17-31. 1995 Using subject matter experts to assess content representation: A MDS analysis. Applied Psychological Measurement 19:241-255. Young, F.W., and D.F. Harris 1993 Multidimensional scaling. Pp. 155-222 in SPSS for Windows: Professional Statistics, computer manual, version 6.0, M.J. Noursis, ed. Chicago: SPSS.

Next: 5 Appraising the Dimensionality of the 1996 NAEP Science Assessment Data »

Grading the Nation's Report Card: Research from the Evaluation of NAEP (2000)

Chapter: 4 An External Evaluation of the 1996 Grade 8 NAEP Science Framework

Welcome to OpenBook!

Get Email Updates