Read "Grading the Nation's Report Card: Research from the Evaluation of NAEP" at NAP.edu

« Previous: 7 Issues in Phasing Out Trend NAEP

Page 152 Cite

Suggested Citation:"8 Issues in Combining State NAEP and Main NAEP." National Research Council. 2000. Grading the Nation's Report Card: Research from the Evaluation of NAEP. Washington, DC: The National Academies Press. doi: 10.17226/9751.

Page 153 Cite

Page 154 Cite

Page 155 Cite

Page 156 Cite

Page 157 Cite

Page 158 Cite

Page 159 Cite

Page 160 Cite

Page 161 Cite

Page 162 Cite

Page 163 Cite

Page 164 Cite

Page 165 Cite

Page 166 Cite

Page 167 Cite

Page 168 Cite

Page 169 Cite

Page 170 Cite

Page 171 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

8 Issues in Combining State NAEP and Main NAEP Michael ]. Kolen Separate data collections are used in the main National Assessment of Edu- cational Progress (NAEP) and the state NAEP. To address concerns that the separate data collections might place too large a burden on the states, this paper examines options for combining main and state NAEP designs. State NAEP is described, and important differences between main NAEP and state NAEP are highlighted. Designs are discussed that have been proposed for merging the two data collections. The focus of these discussions is on how the sample designs interact with operational and measurement concerns. Conclusions and recom- mendations are presented. Significant administration differences between main NAEP and state NAEP exist, which make combining difficult. These differences currently are addressed by adjusting state NAEP scores. It is argued that even with these adjustments contradictory findings and complications are apparent, especially when making the criterion-referenced interpretations of NAEP scores. The administration differences also make implementation of any of the designs for combining main NAEP and state NAEP questionable. Suggestions are made to consider using the same recruitment and administration conditions for main NAEP and state NAEP. The strengths and weaknesses of various designs for combining main and state NAEP are discussed. INTRODUCTION NAEP "is mandated by Congress to survey the educational accomplishments of U.S. students and to monitor changes in those accomplishments" (Ballator, 1996:1~. Originally, NAEP surveyed educational accomplishments and long 152

MICHAEL J. KOLEN 153 term trends with a single assessment. Because of continual changes in the assess- ments, NAEP has evolved into a collection of state and national assessments. Main NAEP is designed to be flexible enough to adapt to changes in assessment approaches. Long-term trend NAEP is intentionally constructed and adminis- tered to be stable so that trends in student performance can be examined over time. Whereas main NAEP and long-term trend NAEP focus on assessing achievement for the nation and for various subgroups of students, state NAEP, which is the most recent addition to NAEP, focuses on achievement of students by state. The National Assessment Governing Board (NAGB) oversees policy for the NAEP program and has called for NAEP to be redesigned (NAGB,1996~. NAGB has expressed concern about the burden placed on states involved in having separate state NAEP and main NAEP data collections. To address this concern, NAGB (1996:7) has stated that, "where possible, changes in national and state sampling procedures shall be made that will reduce [the] burden on states, increase efficiency, and save costs." As part of its evaluation of NAEP, the National Research Council commissioned this paper to examine options for combining main and state NAEP designs. This paper starts by describing state NAEP and highlighting important dif- ferences between main NAEP and state NAEP. A discussion follows of designs that have been proposed for merging the two data collections either by first selecting a national sample and then building state samples or by selecting state samples and then determining which subset of those data could serve as the national sample. The focus of these discussions is on how the sample designs interact with operational and measurement concerns. Finally, conclusions and recommendations are presented. COMPARISON OF MAIN NAEP AND STATE NAEP The main NAEP and long-term trend NAEP assessments were not designed to produce state-level data. To explore the possibility of NAEP providing data at the state level, in 1990, 1992, and 1994 voluntary trial state NAEP assessments were conducted that produced state-level data to compare states to one another and to the nation as a whole. These assessments were considered to be trial assessments because of concerns about their usefulness. Potential benefits of state-level NAEP data are summarized by Phillips (1991) and potential problems by Koretz (1991) and Jones (1996~. The National Academy of Education Panel (1993) that evaluated trial state NAEP recommended that it be continued but with ongoing evaluation and congressional oversight. In 1996 the term trial was removed from the title, and the assessments are now referred to as state NAEP. Recently, others have discussed issues in combining state NAEP and main NAEP, including Forsyth et al. (1996), Glaser et al. (1997), Mullis (1997), Rust (1996), Rust and Shaffer (1997), and Spencer (1996~.

154 ISSUES IN COMBINING STATE NAEP AND MAIN NAEP Content of the Assessments The state NAEP and main NAEP assessment administrations since 1986 and those planned through 2008 are listed in Table 8-1. The table indicates that, although main NAEP typically is administered in grades 4, 8, and 12, state NAEP typically is administered only in grades 4 and 8. In addition, main NAEP is administered in more subject-matter areas. The subject areas for the early state NAEP assessments were only loosely related to those for main NAEP. However, beginning in 1996 and in future plans, state NAEP mathematics and science assessments are to be given in the same years as the main NAEP mathematics and science assessments. A similar statement can be made about the reading and writing assessments. TABLE 8-1 Main NAEP and State NAEP Assessments by Year Since 1986a Main NAEP (grades 4, 8, and 12 Year except where noted) State (or Trial State) NAEP 1986 Reading. Mathematics, Science, Computer Competence 1988 Reading, Writing, Civics, U.S. History 1990 Reading, Mathematics, Science Mathematics (grade 8) 1992 Reading, Writing, Mathematics Mathematics (grades 4 and 8), Reading (grade 4) 1994 Reading, U.S. History, Geography Reading (grade 4) 1996 Mathematics, Science Mathematics (grades 4 and 8), Science (grade 8) 1997 Arts (grade 8) 1998 Reading, Writing, Civics Reading (grades 4 and 8), Writing (grade 8) 1999 2000 Mathematics, Science Mathematics (grades 4 and 8), Science (grades 4 and 8) 2001 U.S. History, Geography 2002 Reading, Writing Reading (grades 4 and 8), Writing (grades 4 and 8) 2003 Civics, Foreign Language (grade 12 only) 2004 Mathematics, Science Mathematics (grades 4 and 8), Science (grades 4 and 8) 2005 World History, Economics 2006 Reading, Writing Reading (grades 4 and 8), Writing (grades 4 and 8) 2007 Arts 2008 Mathematics, Science Mathematics (grades 4 and 8), Science (grades 4 and 8) aAssessments administered from 1986 to 1994 are adapted from Allen et al. (1996); small special- interest assessments are not shown. Assessments administered from 1996 to 2008 are from National Assessment Governing Board (1997). Future assessments reflect plans.

MICHAEL J. KOLEN 155 In recent state NAEP assessments (Allen and Mazzeo, 1997; Allen et al., 1997) the assessment exercises used in state NAEP have been identical to ones used in main NAEP. In addition, the scores from state NAEP have been reported on the NAEP proficiency scale. Administration Procedures State NAEP and main NAEP differ in administration procedures. According to Allen et al. (1997:13~: The state assessments differed from the national assessment in one important regard: Westat [NAEP contractor] staff collected the data for the national assessment while, in accordance with the NAEP legislation, data collection activities for the state assessment were the responsibility of each participating jurisdiction. These activities included ensuring the participation of selected schools and students, assessing students according to standardized procedures, and observing procedures for test security. Linking State NAEP to Main NAEP Recognizing that these differences in administration procedures might cause differences in assessment results, linking studies have been conducted by the National Center for Education Statistics (NCES) and its contractors to estimate the effects of administration differences and to adjust scale scores for any effects that exist. The rationale for these studies has been described by Yamamoto and Mazzeo (1992:168) and is summarized here: Because the assessment instruments for [trial state NAEP and main NAEP] were identical, one of the common-item approaches to linking the scales might have been considered. However, the rationale for such an approach is based on an assumption that the item response functions for the . . . items were the same under the [trial state NAEP and main NAEP] . . . test administration conditions. The aforementioned considerations [differences in administration conditions], as well as data from the assessment itself, suggest otherwise. Thus, although the same items are used in state NAEP and main NAEP, concerns about the effects of differences in administration procedures led to the decision to independently scale the two assessments. Linking studies that have been conducted use a common person design, in which a sample of examiners from main NAEP is matched to the state NAEP sample. These linking studies have not only estimated the size of the effects of differences in administration conditions but also attempted to adjust for them. Allen et al. (1997:16) described the linking study for the 1996 state assessment in mathematics, which is typical of these linking studies, as follows:

156 ISSUES IN COMBINING STATE NAEP AND MAIN NAEP The results from the state assessment program were linked to those from the national assessment through linking functions determined by comparing the results for the aggregate of all fourth- and eighth-grade public-school students assessed in the state assessment with the results for public-school students of the matching grade within a subsample (the National Linking sample) of the national NAEP sample. The National Linking sample for a given grade is a representative sample of the population of all grade-eligible public-school stu- dents within the aggregate of the 45 participating states and the District of Columbia (excluding Guam and the two DoDEA jurisdictions). Specifically, the grade 4 National Linking sample consists of all fourth-grade students in public schools in the states and the District of Columbia who were assessed in the national mathematics assessment. The grade 8 National Linking sample is equivalently defined for eighth-grade students who participated in the national assessment.... Each mathematics content strand scale was linked by matching the mean and standard deviation of the scale score averages across all fourth- or eighth-grade students in the matching grade National Linking sample. Thus, the linking sample for main NAEP is a subset of main NAEP that is matched as closely as possible with the state samples. Such linking studies appear to have been successful in adjusting for administration differences to the extent that the distribution of scale scores for the matched sample for state NAEP was found to be acceptably close to the distribution of scale scores for the main NAEP matched sample (see, e.g., Allen and Mazzeo, 1997, and Allen et al., 1997~. Magnitude of the Effects of Administration Differences Because the main and state NAEP assessments used exactly the same exer- cises, the effects of the different administration procedures can be investigated directly by comparing the proportion correct on items from the two matched samples. If there were no administration differences between the two assess- ments, the proportion correct, apart from sampling error, for each item and on average, over all items, would be the same for the two assessments. However, when these linking studies have been conducted, it has been found repeatedly that the average proportion correct on state NAEP tends to be higher than the average proportion correct on main NAEP for the matched samples. This finding sug- gests that, on average, students can be expected to correctly answer more items when an assessment is administered under state NAEP administration procedures than when an assessment with identical questions is administered under main NAEP administration conditions. Yamamoto and Mazzeo (1992) reported that for the matched samples in the 1990 trial state assessment in mathematics the average proportion correct was .02 higher on the trial state NAEP than on the main NAEP assessment. In another example, based on linking studies for the NAEP reading assessment, Spencer (1996) reported nearly a .01 difference in average proportion correct in 1992 and a difference of .03 in 1994.

MICHAEL J. KOLEN 157 Administration Differences Responsible for Differences in Assessment Results Results of the linking studies indicate that some aspects of the differences in administration of the two assessments are resulting in systematic differences in the average proportion correct on the two assessments. Hartka and McLaughlin (1993) identified motivational differences as one possible explanation and specu- lated that: One condition that might lead to higher scores on the TSA [trial state NAEP] is higher motivation among students. In the TSA, quality control monitors recorded instances of local school personnel giving students incentives to par- ticipate.... Another possibility is that different personnel administering the assessments (Westat staff for national [main] NAEP and local school personnel for the TSA) created different climates in the schools and that this contributed to the difference in performance between the national and TSA samples. Spencer (1996) reported that there may be differences in participation rates for the two assessments. He presented data for the 1994 trial state assessment indi- cating that the overall percentage of sampled schools that participated was lower than for main NAEP in 1994. For this assessment the percentage of students participating in school was higher for the trial state NAEP than for main NAEP. Hartka and McLaughlin (1993) found differences in some of the background characteristics of students participating in state NAEP and main NAEP. Although many possible reasons for the differences might exist, Spencer pointed out that it can be difficult to assess the importance of each aspect of these administration differences. Implications of Differences for Score Interpretation Apparently, the linking studies that adjust for differences in administration conditions have the following as their goal: the scale scores reported for a par- ticular state should reflect the scale scores that state would have received had the state assessment been administered under the conditions used to administer the main NAEP assessment. Various assumptions are implicit in conducting these linking studies, and a single set of linking constants is applied for all jurisdictions. This procedure seems sensible insofar as administration differences between main NAEP and state NAEP are the same from state to state. However, it seems likely that administration conditions differ across states. If so, the assessments would be more accurate for some states than for others. The overall adjustment would be unable to correct for these differences in accuracy. Consider the following hypothetical illustration. States 1 and 2 have the same mean scale scores as the nation if the assessment is administered under main NAEP administration conditions. This common average scale score is 270, and the average percentage of the exercises correct is 60 percent for the two states

158 ISSUES IN COMBINING STATE NAEP AND MAIN NAEP and the nation. When state NAEP is actually administered, state 1 carefully follows the prescribed administration conditions, and the average percentage of the exercises correct for state 1 is 60 percent. State 2 is not so careful in follow- ing the administration procedures, and its average percentage of the exercises correct is 64 percent. Also, over all states the average percentage of items correct in state NAEP is 62 percent. State NAEP is then linked to main NAEP. Based on this study, a state with an average percentage of items correct in state NAEP of 62 percent will have an average scale score of 270. Following this linking study, state 1 earns an average scale score of below 270, which is below the average for the nation and below the average for state 2. State 2 earns an average scale score above 270, which is above the average for the nation and above the average for state 1. In effect, state 1 has been penalized for carefully following the administration procedures. State 2 has been rewarded for not taking as much care. This sort of situation, while presented in a hypothetical example, is bound to occur if there is variation across states in the effects of administration procedures on NAEP performance. An overall adjustment, like the one currently applied, is unable to remove these sorts of inequities that are a result of administration differences from one state to another. The conditions that require a study for linking state NAEP to main NAEP can also lead to apparent contradictions in statistics that are reported with state NAEP. These contradictions are apparent when comparing the states to the nation on statistics that are based on percentages of items correct. Table 8-2 presents scale scores and percentages correct for the nation and for various states for the 1992 NAEP trial state assessment in mathematics for eighth grade. Statis- tics are presented for the nation and for the states of New York, Delaware, and Arizona. Scale scores are presented in the top portion of the table. New York has the same average overall scale score as the nation; the average overall scale score for Arizona is slightly below that for the nation; and the average overall scale score for Delaware is four points below that for the nation. Comparisons of the five subscales lead to similar conclusions about how the states compare to the nation. These scale score averages incorporate the adjustments from the study that linked state NAEP to the main NAEP scale. Average percentages correct over multiple-choice and constructed-response items are given in the bottom portion of the table. On average, New York correctly answered 2 percent more of the items than were answered correctly in the nation. Thus, based on the bottom portion of the table, New York appears to be higher performing than the nation. Although Delaware performed more poorly than the nation based on scale scores, the state performed similarly based on average percentage correct. Some contradictory conclusions result from inspec- tion of this table. Arizona is below the national average on scale scores but is, on average, able to answer more items correctly than answered in the nation. New York is at the national average based on scale scores but above the national

MICHAEL J. KOLEN TABLE 8-2 Main NAEP and State NAEP Mean Scale Scores and Average Percentage Correct for the Nation and Three States in the 1992 State and Main NAEP Mathematics Assessments 159 Index Nation New York Arizona Delaware Scale Score Overall 266 266 265 262 Numbers and operations 270 270 269 267 Measurement 264 262 264 258 Geometry 262 261 260 257 Data analysis, statistics, and probability 267 268 265 262 Algebra and functions 266 265 264 263 Percentage Correct (Multiple-Choice and Constructed-Response) Overall 54 56 55 54 Numbers and operations 62 64 63 62 Measurement 51 52 52 50 Geometry 52 54 53 52 Data analysis, statistics, and probability 48 51 48 48 Algebra and functions 51 53 51 50 Source: National Center for Education Statistics (1993:43, 126, 341). average based on percentage correct. Delaware is below the national average on scale scores but at the national average based on percentage correct. In National Center for Education Statistics (1993:46), of the 44 jurisdictions shown, 50 percent are above the national average in scale score. However, for these 44 jurisdictions, over 61 percent are above the national average based on percentage correct. These contradictions arise because scale score statistics reported for states are adjusted for administration differences, whereas percent- age-correct scores are not adjusted. Such contradictions and other related issues that result from the need to conduct linking studies are particularly troublesome in the more criterion-referenced uses of NAEP. One of the related issues is that IRT (item response theory) parameter estimates for a given item could differ considerably from the main NAEP to the state NAEP assessment. Implications of Differences for Item Maps and Achievement Levels Item maps and achievement levels are two of the procedures used to help policy makers and the public better understand NAEP results. In item maps, various scale score levels are chosen and items found that discriminate between pairs of adjacent levels. The following example, based on the 1996 NAEP mathematics assessment, is taken from Reese et al. (1997:9~: To better illustrate the NAEP mathematics scale, questions from the assessment are mapped onto the O-to-SOO scale at each grade level. These item maps are

160 ISSUES IN COMBINING STATE NAEP AND MAIN NAEP visual representations that compare questions with ability, and they indicate which questions a student can likely solve at a given performance level as measured on the NAEP scale. . . . As an example of how to interpret the item maps, consider a multiple-choice question that requires students to identify cylindrical shapes and maps at a scale score of 208 for grade 4. ... Mapping a question at a score of 208 implies that students performing at or above this level on the NAEP mathematics scale have a 74 percent or greater chance of correctly answering this particular question. Students performing at a level lower than 208 would have less than a 74 percent chance of correctly answering the question.... As another example, consider a constructed-response ques- tion that requires students to partition the area of a rectangle and maps at a score of 272 for grade 8. . . . Scoring of this response allows for partial credit by using a four-point scoring guide. Mapping a question at a score of 272 implies that students performing at or above this level have a 65 percent or greater chance of receiving a score of 3 (Satisfactory) or 4 (Complete) on the question. Students performing at a level lower than 272 would have less than a 65 percent chance of receiving such a score. Reese et al. (1997:9, In. 6) go on to say that: For constructed-response questions a criterion of 65 percent was used. For multiple-choice questions with four or five alternatives, the criteria were 74 and 72 percent, respectively. The use of a higher criteria for multiple-choice ques- tions reflected students' ability to "guess" the correct answer from among the alternatives. Main NAEP data are used to construct the item maps. Recall that students tend to score higher when using state NAEP administration than when using main NAEP administration conditions. So on state NAEP students at a particular ability would tend to have a greater chance of correctly answering particular multiple- choice items and a greater chance of receiving higher scores on constructed- response items than the item maps would imply. Alternatively, if the item maps had been constructed using state NAEP data, the items would have tended to have been mapped at a higher score level than they were mapped using main NAEP data. Also, the parameter estimates for individual items on state NAEP differ from those on main NAEP. Therefore, if the item maps had been constructed using state NAEP item parameter estimates instead of main NAEP parameter estimates, the item mapping for particular items could differ considerably, possibly in either direction. Achievement levels are another means used to enhance the interpretability of NAEP results. As stated in Reese et al. (1997:42), a judgmental process is used to set achievement levels: The result of the achievement level-setting process is a set of achievement level descriptions and a set of achievement level cutpoints on the 500-point NAEP scale. The cutpoints are minimum scores that define Basic, Proficient, and

MICHAEL J. KOLEN Advanced performance at grades 4, 8, and 12. . . . The results are based on the judgments of panels, approved by NAGB, of what Basic, Proficient, and Ad- vanced students should know and be able to do in mathematics, as well as on their judgments regarding what percent of students at the borderline for each level should answer each question correctly. The latter information is used in translating the achievement level descriptions into cutpoints on the NAEP scale. 161 As with the item maps, achievement levels are set using main NAEP data. It is likely that somewhat different achievement descriptions and cutpoints would emerge from the achievement-level-setting process if state NAEP data were used instead of main NAEP data. For score-reporting purposes, the percentage of examiners in a state who are reported to score at or above a particular achievement level are based on score distributions that have been adjusted in the study in which state NAEP was linked to main NAEP. To the extent that students earn higher scores on state NAEP than on main NAEP, the effect of this adjustment is to lower the percentages at or above each cutpoint for state NAEP. That is, on state NAEP there is a tendency for a greater proportion of students to score at or above each achievement level than the proportions reported in the state NAEP program. To handle the effects on reported scores of the administration differences between state NAEP and main NAEP, a decision was made to adjust the state NAEP scores. While understandable and possibly the best decision given the circumstances, this decision can lead to potential misinterpretations and inaccu- racies in interpreting scores from state NAEP. These problems seem most serious when attempting to make criterion-referenced interpretations of scores, such as those made with item maps and achievement levels. DESIGNS FOR COMBINING STATE AND MAIN NAEP SAMPLES In this section, issues in developing designs for combining state and main NAEP samples are discussed. Currently, sampling, administration, and analysis (other than the study used to adjust for administration differences) are done separately for state and main NAEP. Rust (1996) suggested three general ap- proaches to combining state and main NAEP. In one approach the sampling and administration continue to be separate, but the analyses are based on pooled data. In another approach a national sample is drawn and supplemented as necessary to obtain an adequate state sample. Finally, samples are drawn from each state and supplemented as necessary to obtain an adequate national sample. Specific pro- posals presented by Rust and Shaffer (1997) and Spencer (1996) for implement- ing these general approaches are discussed here. This discussion of sampling procedures relies heavily on work by sampling statisticians, including Rust (1996), Rust and Johnson (1992), Rust and Shaffer (1997), and Spencer (1996~. The designs suggested in these papers are reviewed here. The designs are summarized and how they interact with various administra

162 ISSUES IN COMBINING STATE NAEP AND MAIN NAEP live and measurement issues is evaluated. The focus is on practical design issues; there is no intent to provide a sampling statistician's perspective on these issues. Independent samples of schools are used in main and state NAEP, and differ- ent designs currently are used for selecting samples in the two programs. Efforts are made to ensure that no one school is included in both samples. In addition, as is discussed, the sampling designs used in the two programs have important differences. In the schedule for future assessments, as shown in Table 8-1, more subject areas and more grades will be included in main NAEP than state NAEP. How- ever, in the future main and state NAEP will assess grade 4 and grade 8 math- ematics and science in the same years (e.g., 2000, 2004, and 2008) and grade 4 and grade 8 reading and writing in the same years (e.g., 2002 and 2006~. The following discussion of combining the state and main NAEP samples pertains only to these combinations of grade, test, and year. As stated by Rust and Johnson (1992: 127), "the NAEP sampling and weight- ing procedures are designed to obtain sample data that permit estimates of sub- population characteristics of reasonably high precision." The precision targets are stated ahead of time, and samples are designed to meet these targets. Current Design for Main NAEP The goal of the main NAEP sample design is to adequately represent the population of students in the United States in a particular grade as well as certain subpopulations. According to Rust and Johnson (1992), the main NAEP samples are drawn using a multistage probability sampling design with three stages of selection. The three stages are summarized as follows: Stage 1. The United States is divided into approximately 1,000 geographical areas. A sample of these geographical areas is selected. Stage 2. A sample of schools is selected from within the selected geographi cal areas. Stage 3. A sample of students is selected from within the selected schools. According to Rust and Johnson (1992: 112), Stage 1 is used "to make feasible the task of recruiting and training staff to administer the tests in a cost effective manner" because the assessments will be given in only a small number of geo- graphical areas (e.g., Rust and Johnson, 1992, reported that in main NAEP in 1990 only 94 of the geographical areas were selected). Stratification and weight- ing procedures are used to ensure that the sample is representative and that the desired levels of precision are attained. In addition, procedures are used to deal with schools that are selected but decline to participate. Recruiting of schools and test administration are done centrally by a single NAEP contractor. Data analysis for main NAEP is conducted using the national data only.

MICHAEL J. KOLEN 163 Current Design for State NAEP The goal of the state NAEP sample design is to adequately represent the population of students in a given state in a particular grade as well as certain subpopulations. To reduce the burden on schools, efforts are made to ensure that schools chosen for state NAEP are not in main NAEP. The two-stage probability sample used in each state that participates in state NAEP is summarized as follows: Stage 1. A sample of schools is selected from within the state. Stage 2. A sample of students is selected from within the selected schools. Stratification and weighting procedures are used to ensure that the sample is representative and that the desired levels of precision are attained. In addition, procedures are used to deal with schools that are selected but decline to partici- pate. See Rust and Johnson (1992) for more detail. Recruiting of schools and test administration are conducted by personnel in the state. As indicated earlier, a linking study is used to adjust state NAEP results for differences in administration conditions between state NAEP and main NAEP. Recall that a single set of linking functions is developed and used to adjust the results for all states. Apart from using main NAEP data to estimate linking functions, data analysis for state NAEP is conducted using the state data only. Some possibilities for combining the main and state NAEP sample designs and/or data analyses follow. Spencer's (1996) Approaches One way to combine the two assessments, referred to here as Spencer's Approach 1, uses the current designs and administration procedures for both assessments and then pools the data during analysis. The potential benefit of using this procedure is that sampling error could be reduced for national and regional statistics by including the state data along with the main NAEP data. In addition, the sampling error for the state statistics could be reduced by using main NAEP data from a state along with the state NAEP data from that state. However, combining the main and state NAEP data relies heavily on the linking study used to adjust state NAEP scores for differences in administration conditions between main and state NAEP. As Spencer (1996) pointed out, the linking adjustment introduces error, and it would be necessary to ensure that the random error and bias due to linking are negligible; otherwise, this approach could increase error. Spencer also pointed out that there would be some addi- tional costs associated with conducting the analyses, creating new weights, and estimating standard errors. He recommended further study of this possibility. Spencer considered a second possibility, referred to here as Spencer's

164 ISSUES IN COMBINING STATE NAEP AND MAIN NAEP Approach 2, intended to save money and increase precision by combining the sampling designs for main and state NAEP into one integrated design. He pre- sented the following possibility: "Select the national sample and see how many schools fall in each state. Then draw an additional sample of schools in each state in state NAEP to meet the target precision for that state" (Spencer, 1996:54~. In this design, therefore, the current main NAEP sampling plan is used, but the state plan is modified. For main NAEP, recruiting of schools and test administration are still done centrally by a single NAEP contractor. For the additional schools in each state that are selected, recruiting of schools and test administration are still done by state personnel. Spencer also suggested that, as with Spencer's Ap- proach 1, sampling error for main NAEP might be reduced if data from main NAEP and state NAEP were pooled for main NAEP analyses. Preliminary analyses conducted by Spencer suggested that Spencer's Approach 2 procedures leads to approximately a 6 percent reduction in the sample size for state NAEP, which results in significant cost savings in test materials, booklet processing, test scoring, and other administration costs. As with Spencer's Approach 1, there are some (relatively small) additional costs associ- ated with conducting the analyses. Note that under this design, to meet target precision for the states, it is necessary to pool data from the state and main NAEP samples. These precision targets could be met only if the random error and bias due to linking are negligible. Spencer recommended that this design be studied further. Spencer also considered a third possibility, referred to here as Spencer's Approach 3, that saves even more money and reduces the sample size for main NAEP. He suggested the following possibility: "Select the state NAEP sample first and then draw a supplemental sample to yield a national sample meeting the target levels of precision overall and for subgroups. These target levels of preci- sion would be met both for the subjects and grades covered and state NAEP and also for those not covered" (Spencer, 1996:55~. Spencer demonstrated that this design leads to substantial savings, beyond those for Spencer' s Approach 2. However, he pointed out that implementing this possibility requires that "decisions about what states will participate in state NAEP and what subjects will be covered must be made before combined NAEP can be designed. . . . Success would seem unlikely" (Spencer, 1996: 55~. The concerns regarding linking error in this design are even more severe than they are for Spencer's Approach 2, because for Spencer's Approach 3 it is necessary to pool data from state and main NAEP samples to meet target precision for main NAEP. Rust and Shaffer's (1997) Sampling Possibilities Rust and Shaffer (1997) compared three sample designs. The first design, referred to here as Rust and Shaffer's Approach 1, involves combining the sepa

MICHAEL J. KOLEN 165 rate samples that are currently used in main and state NAEP. This design is essentially the same as Spencer's Approach 1. In their second proposed design, referred to here as Rust and Shaffer's Approach 2, they moved away from use of the 1,000 geographical areas that are currently used for main NAEP.i They proposed using the following procedures: Stage 1. A sample of schools is selected from within each state that results in precision comparable to current state NAEP. Stage 2a. Among the selected schools in each state, designate a subset as national schools (with a minimum of two schools per state). Over all states the results from just these schools would result in precision comparable to current national NAEP. The number of national schools selected in this way is compa- rable to the current number of national schools. Stage 2b. Among the selected schools, those not designated as national schools are designated as state schools. Stage 3a. A sample of students is selected from the selected national schools. Stage 3b. Only if a state agrees to participate, a sample of students is selected from the selected state schools. Stratification and weighting procedures are used to ensure that the sample is representative and that the desired levels of precision are attained. In addition, procedures are used to deal with schools that are selected but decline to partici- pate. As is currently done, administration by national schools is conducted by an NCES contractor, and state administration is conducted by state staff. In a departure from current procedures, recruitment is done by staff in states partici- pating in state NAEP.2 Rust and Shaffer (1997) suggested that the analyses for main NAEP be based on all participating schools (both national and state), although the designed preci- sion could be obtained from national data. State NAEP analyses are based on all {Recall that Rust and Johnson (1992:112) indicated that these geographical areas were used as a first stage of sampling to "make feasible the task of recruiting and training staff to administer the tests in a cost effective manner." Rust and Shaffer (1997) did not indicate why it is now possible to move away from the use of geographical areas as a first-stage sampling unit. Note that the use of these geographical areas as a first-stage sampling unit results in more sampling error than if schools were sampled at the first stage (ACT, 1997). 2Rust and Shaffer (1997) suggested that this change would enhance participation in main NAEP. However, they did not discuss how this enhanced participation, if it did exist, might affect the comparability of main NAEP scores between current main NAEP and main NAEP after the change in recruitment procedures was made. It seems, however, that who recruits schools is not really an integral part of their design in that the design could be followed with the NCES contractor continuing to recruit schools. Clearly, this issue would require further study before a change in recruitment procedures is made.

166 ISSUES IN COMBINING STATE NAEP AND MAIN NAEP participating schools in the state (both national and state) to meet state precision targets. This design has some potential significant benefits. Preliminary analyses conducted by Rust and Shaffer suggested that these procedures lead to an approxi- mate 10 percent reduction in the sample size for state NAEP, compared to current procedures, which leads to significant cost savings. The precision of national statistics is comparable to current precision if the national data are used alone. The national statistics are more precise if the state and national data are pooled for main NAEP. Rust and Shaffer (1997:6-11) also discussed the benefits to recruitment from the "synergism in the recruitment process for state and national components" if states do all of the recruitment. As with the other designs that involve an integration of main and state NAEP data, a major issue concerning this design is that it requires a linking study to adjust state results for differences in state and national administration conditions. The gain in precision for main NAEP and the state precision targets likely could be achieved only if the random error and bias due to linking are negligible. In addition, this design requires considerable coordination of state and national NAEP. The final proposed design, referred to here as Rust and Shaffer's Approach 3, dropped the requirement of Rust and Shaffer' s Approach 2 that the target preci- sion for the national statistics be attainable using only the national data. A major effect of dropping this requirement is to reduce the number of test administrations that are done by the NCES contractor. The stages provided earlier for Rust and Shaffer's Approach 2 would still be followed, except that Stage 2a would be replaced by the following: Stage 2a. Among the selected schools in each state, designate a subset as national schools (with a minimum of two schools per state). Over all states the results from just these schools do not result in precision comparable to current main NAEP. The number of national schools selected in this way is around one- half of the current number of national schools. As in Rust and Shaffer' s Approach 2, stratification and weighting procedures are used to ensure that the sample is representative and that the desired levels of precision are attained; procedures are used to deal with schools that are selected but decline to participate; administration by national schools is conducted by an NCES contractor, whereas state administration is conducted by state staff; all recruitment is conducted by state staff. Unlike Rust and Shaffer's Approach 2, the analyses for main NAEP to achieve target precision are based on all participating schools (both national and state). Like Rust and Shaffer' s Approach 2, state NAEP analyses are based on all participating schools in the state (both national and state) to meet state precision targets. Preliminary analyses by Rust and Shaffer (1997:6-10) indicated that this

MICHAEL J. KOLEN 167 design has all the potential benefits of Rust and Shaffer's Approach 2, with the addition that the sample size that requires administration by the NCES adminis- tration contractor is reduced and the overall sample size is reduced even further. However, these analyses also indicated that benefits depend heavily on the de- gree of participation in state NAEP. In addition, this design requires use of the results of the linking study to achieve the desired precision for main NAEP. For these reasons Rust and Shaffer recommended further consideration of Rust and Shaffer's Approach 2 but not Rust and Shaffer's Approach 3 because the former design is "considerably more robust to the vagaries of the outcome of the state participation process." Rust and Shaffer (1997:6-25) concluded that Rust and Shaffer's Approach 2 should be considered because "this approach will lead to much more useful data at the national and regional levels. It will enhance participation in centrally administered schools. It will have little impact on cost. The approach is robust to the level of state participation in NAEP." Discussion and Comparison of the Approaches Spencer's Approach 1 and Rust and Shaffer's Approach 1 involve no changes in the sample designs. These approaches have the potential to increase precision. The additional costs associated with these approaches involve further analyses, which likely are small compared to the administrative costs. The major potential drawback of either of these approaches is that they rely on there being little random error or bias when adjusting state NAEP results for operational differ- ences between state NAEP and main NAEP. The sources of these operational differences and their degree of stability should be thoroughly understood before these approaches are used. Spencer's Approach 2 continues to use geographical area as the first stage in a multistage sampling procedure, whereas Rust and Shaffer's Approach 2 elimi- nates this first stage. This elimination might cause some operational difficulties in that administration of main NAEP would occur in more diverse geographical areas. However, if this first stage is eliminated, fewer schools would need to be sampled for main NAEP, which is true whether or not the samples are combined (ACT, 1997~. Thus, if the first stage can be eliminated, at least in this aspect, Rust and Shaffer's Approach 2 seems preferable to Spencer's Approach 2. How- ever, it is unclear why the first stage can be eliminated, whereas it was deemed necessary in the past. This issue needs to be addressed before further consider- ation of Rust and Shaffer' s Approach 2. A major issue with both Spencer's Approach 2 and Rust and Shaffer's Ap- proach 2 is that both rely heavily on there being little random error or bias in adjusting state NAEP results for operational differences between state NAEP and main NAEP. The sources of these operational differences and their degree of stability should be thoroughly understood before these approaches are used.

168 ISSUES IN COMBINING STATE NAEP AND MAIN NAEP Forsyth et al. (1996) also indicated that it will be important to design the ap- proaches so that last-minute withdrawals of states do not affect the main NAEP samples. Given problems that accrue from the need for the linking study, Forsyth et al. suggested it might be possible to design NAEP so that the same administration conditions are used for main and state versions. In particular, they suggested using local administrators for main NAEP (as well as for state NAEP), with an increase in the monitoring and degree of training of the administrators. If this approach is considered, however, they suggest that the effects of such a change be monitored on participation rates among schools selected for main NAEP in states not participating in state NAEP. In addition, such a significant change in main NAEP could affect comparability of national statistics before and after the change is made. Combining main and state NAEP sampling has the potential for a modest reduction in the number of schools involved in NAEP. However, much more work is needed to detail and evaluate the approaches before they are imple- mented. A significant problem in each approach arises from the operational differences between main and state NAEP that cause complications potentially difficult to overcome. Unless the operational procedures for main NAEP and state NAEP can be made much more similar to one another, the potential compli- cations caused by these approaches might lead to severe problems in combining NAEP samples. CONCLUSIONS Future plans are for state NAEP to be administered at approximately the same time as main NAEP and for the content of state NAEP to be a subset of the content of main NAEP. These plans suggest that now there might be a greater chance of combining main and state NAEP samples than in the past. However, current plans still result in significant administration differences between main and state NAEP. These differences currently are addressed by adjusting state NAEP scores. Even so, contradictory findings and complications are apparent, especially when making the criterion-referenced interpretations of NAEP scores that seem to be gaining prominence through the use of item maps, achievement levels, and now market basket reporting (Forsyth et al., 1996; National Center for Education Statistics, 1996~. The conditions that make the linking studies neces- sary create confusion when attempting to make criterion-referenced interpreta- tions with state NAEP. The administration differences also make implementation of any of the designs for combining main and state NAEP questionable. Much more needs to be known about the effects of the administration differences. A starting point for further investigation would be to address the following questions:

MICHAEL J. KOLEN 169 Question 1: To what extent are the linking constants equal across states? Differences among states in ability, participation rates, and recruitment proce- dures should be investigated as variables that might influence linking constants. Question 2: How large is the random error component in estimating the linking constants? Question 3: linking constants? Question 4: To what extent would results from state NAEP be affected if the administration and recruitment conditions for state NAEP were changed to be consistent with those for main NAEP? Question 5: Do the differences in administration and recruitment conditions affect the constructs that are being measured by the NAEP assessments? To what extent does bias or systematic error influence the These questions should be thoroughly addressed before any design for combining the state and main NAEP samples is implemented under current recruitment and administration conditions. Note that even after conducting the extensive research that addressing these questions entails, the analyses presented in Spencer (1996) and Rust and Shaffer (1997) suggest that combining the samples for state and main NAEP would result in only a modest decrease in sample size. Another approach is to use administration and recruitment procedures that are the same for main and state NAEP, such as those suggested by Forsyth et al. (1996~. One possibility is to use the centralized administration and recruitment procedures currently used with main NAEP. Using these procedures for both main and state NAEP is optimal from the perspectives of combining samples, of having comparable results for the two assessments, for combining reporting and analyses, and for being able to compare main NAEP results from before and after changes were made in recruitment and administration procedures. Although these procedures might be prohibitive from a cost perspective, they should be thoroughly investigated. Another possibility suggested by Forsyth et al. is to use the current state administration procedures for main NAEP but possibly with more central over- sight and standardization than is currently used with state NAEP. This type of change in recruitment and administration procedures would require a study to link main NAEP under these new administration conditions to main NAEP under the previous administration conditions. Conducting this study could be costly and difficult to implement. If the issues regarding linking and administration conditions are addressed sufficiently, Spencer' s Approach 2 and Rust and Shaffer' s Approach 2 would be good places to start in developing a combined sampling plan. Spencer's Ap- proach 2 might be preferable if the first-stage sampling is by geographical area. Rust and Shaffer's Approach 2 might be preferable if, from an operational per- spective, this first stage is unnecessary.

170 ISSUES IN COMBINING STATE NAEP AND MAIN NAEP ACKNOWLEDGMENTS The author thanks Karen Mitchell and two anonymous reviewers for com- ments on a draft of this paper. REFERENCES ACT 1997 ACT's NAEP Redesign Project: Assessment Design Is the Key to Useful and Stable Assessment Results. Final Report. Iowa City, Iowa: ACT. Allen, N.L., and J. Mazzeo 1997 Technical Report of the NAEP 1996 State Assessment Program in Science. Washington, D.C.: National Center for Education Statistics. Allen, N.L., D.L. Kline, and C.A. Zelenak 1996 The NAEP 1994 Technical Report. Washington, D.C.: National Center for Education Statistics. Allen, N.L., F. Jenkins, E. Kulick, and C.A. Zelenak 1997 Technical Report of the NAEP 1996 State Assessment Program in Mathematics. Wash- ington, D.C.: National Center for Education Statistics. Ballator, N. 1996 The NAEP Guide, Revised Edition. Washington, D.C.: National Center for Education Statistics. Forsyth, R., R. Hambleton, R. Linn, R. Mislevy, and W. Yen 1996 Design Feasibility Team Report to the National Assessment Governing Board. Washing- ton, D.C.: National Assessment Governing Board. Glaser, R., R. Linn, and G. Bohrnstedt 1997 Assessment in Transition: Monitoring the Nation's Educational Progress. Stanford, Calif.: National Academy of Education. Hartka, E., and D.H. McLaughlin 1993 A Study of the Administration of the 1992 National Assessment of Educational Progress Trial State Assessment Program. Palo Alto, Calif.: American Institutes for Research. Jones, L.V. 1996 A history of the National Assessment of Educational Progress and some questions about its future. Educational Researcher 25(7): 15-22. Koretz, D.M. 1991 State comparisons using NAEP: Large costs, disappointing benefits. Educational Re- searcher 20(3): 19-21. Mullis, I.V.S. 1997 Optimizing State NAEP: Issues and Possible Improvements. Paper commissioned by the NAEP Validity Studies Panel. National Academy of Education 1993 The Trial State Assessment: Prospects and Realities. Stanford, Calif.: National Acad- emy of Education. National Assessment Governing Board (NAGB) 1996 Policy Statement on Redesigning the National Assessment of Educational Progress. Washington, D.C.: NAGB. 1997 Schedule for the National Assessment of Educational Progress. Washington, D.C.: NAGB. National Center for Education Statistics 1993 Data Compendium for the NAEP 1992 Mathematics Assessment of the Nation and the States. Washington, D.C.: National Center for Education Statistics.

MICHAEL J. KOLEN 171 1996 An Operational Vision for NAEP-Year 2000 and Beyond. Washington, D.C.: National Center for Education Statistics. Phillips, G.W. 1991 Benefits of state-by-state comparisons. Educational Researcher 20(3):17-19. Reese, C.M., K.E. Miller, J. Mazzeo, and J.A. Dossey 1997 NAEP 1996 Mathematics Report Card for the Nation and the States. Washington, D.C.: National Center for Education Statistics. Rust, K.F. 1996 Sampling Issues for Redesign. Memorandum to Mary Lyn Bourque, NAGB, May 9. Rust, K.F., and E.G. Johnson 1992 Sampling and weighting in the national assessment. Journal of Educational Statistics 17(2): 111- 129. Rust, K.F., and J.P. Shaffer 1997 Sampling. In NAEP Reconfigured: An Integrated Redesign of the National Assessment of Educational Progress, E.G. Johnson, S. Lazer, and C.Y. O'Sullivan, eds. Working Paper No. 97-31. Washington, D.C.: National Center for Education Statistics. Spencer, B. 1996 Combining State and National NAEP. Paper prepared for the evaluation of state NAEP conducted by the National Academy of Education. Yamamoto, K., and J. Mazzeo 1992 Item response theory linking in NAEP. Journal of Educational Statistics 17:155-173.

Next: 9 Difficulties Associated with Secondary Analysis of NAEP Data »

Grading the Nation's Report Card: Research from the Evaluation of NAEP (2000)

Chapter: 8 Issues in Combining State NAEP and Main NAEP

Welcome to OpenBook!

Get Email Updates