3
Challenges of Linking to NAEP

In recent years there has been increasing interest in linkage to the National Assessment of Educational Progress (NAEP). A survey of NAEP's constituents asked respondents to assess their state's willingness to pay for three different services: a state-level assessment using NAEP's current approach; linking NAEP results with the state's regular assessment; and the provision of extra "NAEP-like" assessments for states to use as they wish. Although states were unwilling to pay for most services, two-thirds said they would pay to develop linkages between their state assessments and NAEP (Levine et al., 1998). Several states have already begun partial linkages: three states compare one component of their assessment program to NAEP; two link at least one component of their state results to the NAEP results; and one links at least one component of its assessment to the NAEP scale (Bond et al., 1998). In addition, the Voluntary National Tests are being designed to be linked to NAEP.

The appeal of being able to link tests to NAEP is not hard to understand. Doing so would enhance substantially the utility of NAEP information. Currently, NAEP reports results at the national and state-levels only and thus provides limited information about the quality of schooling within a state. By linking other results to NAEP, educators and policy makers would be able to compare student, school, and district results to statewide, regional, and national results and so would have a better understanding of how their students and schools are performing and what



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 48
--> 3 Challenges of Linking to NAEP In recent years there has been increasing interest in linkage to the National Assessment of Educational Progress (NAEP). A survey of NAEP's constituents asked respondents to assess their state's willingness to pay for three different services: a state-level assessment using NAEP's current approach; linking NAEP results with the state's regular assessment; and the provision of extra "NAEP-like" assessments for states to use as they wish. Although states were unwilling to pay for most services, two-thirds said they would pay to develop linkages between their state assessments and NAEP (Levine et al., 1998). Several states have already begun partial linkages: three states compare one component of their assessment program to NAEP; two link at least one component of their state results to the NAEP results; and one links at least one component of its assessment to the NAEP scale (Bond et al., 1998). In addition, the Voluntary National Tests are being designed to be linked to NAEP. The appeal of being able to link tests to NAEP is not hard to understand. Doing so would enhance substantially the utility of NAEP information. Currently, NAEP reports results at the national and state-levels only and thus provides limited information about the quality of schooling within a state. By linking other results to NAEP, educators and policy makers would be able to compare student, school, and district results to statewide, regional, and national results and so would have a better understanding of how their students and schools are performing and what

OCR for page 48
--> the national results mean for local decision makers. Moreover, by using the NAEP achievement levels, parents, educators, and policy makers would have more information about how students perform in relation to national benchmarks for achievement. Linking state assessments to NAEP would also enable states to evaluate their own assessments against a national criterion and to compare assessments with those of other states, using NAEP as the common denominator. And a NAEP link would permit associations with other national databases, such as the Common Core of Data and the Schools and Staffing Survey, which would enhance the quality of information available about school factors associated with achievement (Wu et al., 1997). While the prospect of linking state tests to NAEP has substantial appeal, some have raised concern that doing so might undermine the quality of NAEP (e.g., Hill, 1998). The closer NAEP comes to local tests, the less it looks like an independent barometer of student achievement with low-stakes for students or schools. Moreover, the possibility of linking state and commercial tests to NAEP poses serious challenges, even greater than those present in linking other tests. To understand these unique challenges, it is first necessary to understand the distinct character of NAEP. Distinct Character of NAEP NAEP is a periodically administered, federally sponsored survey of a nationally representative sample of U.S. students that assesses student achievement in key subjects. It combines the data from all test takers and uses the resulting aggregate information to monitor and report on the academic performance of U.S. children as a group, as well as by specific subgroups of the student population. NAEP was not designed to provide achievement information about individual students. Rather, NAEP reports the aggregate, or collective, performance of students and it does so in two ways—scale scores and achievement levels: scale scores provide information about the distribution of student achievement for groups and subgroups in terms of a continuous scale; achievement levels are used to characterize student achievement as basic, proficient, or advanced, using ranges of performance established for each grade. The National Assessment Governing Board (NAGB), the body that governs the NAEP program, provides definitions for the three achievement levels. Student

OCR for page 48
--> achievement that falls below the basic range is categorized as below basic (U.S. Department of Education, 1997). NAEP uses matrix sampling to achieve two goals. First, students are asked to answer a relatively small number of test questions, so that the testing task given to students takes a relatively short time. Second, by asking different sets of questions of different students, the assessments cover a much larger array of questions than those given to any one student. By carefully balancing the sets of questions, called blocks, so that each block is taken by the same number of students, an equal number of students is presented with each item, making it possible to estimate the distribution of student scores by pooling data across test takers (Mislevy et al., 1992; Beaton and Gonzalez, 1995). The price paid for this flexibility is the inability of these assessments to collect enough data from any single student to provide valid individual scores. NAEP's structure is unique: each student in the NAEP national sample takes only one booklet that contains a few short blocks of NAEP items in a single subject area (generally, three 15-minute or two 25-minute blocks), and no student's test booklet is fully representative of the entire NAEP assessment in that subject area. The scores for the item blocks a student takes are used to predict his or her performance on the entire assessment. Thus, the portion of NAEP any one student takes is unlikely to be comparable in content to the full knowledge domain covered by an individual test taker in a state or commercial test (see, e.g., U.S. Department of Education, 1997; National Research Council, 1996; Beaton and Gonzalez, 1995; U.S. Congress, Office of Technology Assessment, 1992). These characteristics of NAEP greatly increase the difficulty of establishing valid and reliable links between commercial and state tests and NAEP. However, a matrix sampling design does not pose a permanently insuperable barrier to linking. One could design a linking experiment in which students in a nationally representative sample take a long form of NAEP containing, say, six blocks of test items rather than the typical two or three blocks, as well as a test that was to be linked to NAEP. If a student is assessed with six blocks of NAEP items, it is likely that his or her location on the NAEP scale and his or her assignment to a NAEP achievement-level category will be estimated with reasonable precision. If the test to be equated to NAEP and the conditions of administration and use of the test are sufficiently similar to those of NAEP, a student's score on the linked test likely might then be used to estimate his or her NAEP achievement-level placement with acceptable precision.

OCR for page 48
--> Linking to NAEP A number of studies have attempted to link tests or test batteries to NAEP (see Table 2-1). To do so, they have administered several blocks of NAEP items as a substitute for a NAEP test to individual students and scored them using NAEP methods. In effect, the scores are weighted averages of the item scores, but the item weights depend on the characteristics of the item and on the pattern of item responses given by the individual. Recent proposals to create a NAEP test for use in linking (see, e.g., Yen, 1998) have not yet been realized. Test results can be linked to NAEP scores at different levels of aggregation, and several studies have done so. However, when an assessment is only modestly related to the NAEP scale, links to enable comparing aggregate results, such as averages for states or school districts, would be different from links designed for reporting individual results. The inevitable consequence of this difference is that the proportions of students with scores predicted to be in each of the proficiency categories would depend on which link was used. Individual scores are generally linked to NAEP with the projection method, which is based on regression. Using regression improperly to carry out a projection-based linkage for individuals could result in many fewer students with projected scores in the "advanced" range and also fewer in the "below basic" range. Although the regression approach makes statistical sense, it raises problems if the results are intended to be used for policy purposes. Linking Large-Scale Assessments Many studies have compared the aggregate NAEP results with similar results from state-level assessments, from international assessments, or from other testing programs (see e.g., National Center for Education Statistics, 1998; Pashley and Phillips, 1993; Linn and Kiplinger, 1995). These studies were designed to compare populations of students. For example, the Armed Services Vocational Aptitude Battery (ASVAB), the set of tests used for entrance in to the U.S. Armed Services, has been compared with the NAEP results in 12th-grade mathematics. A link between the ASVAB and NAEP proficiencies was obtained by projecting ASVAB scores to the NAEP scale. The Armed Services could thus compare the achievement of their recruits with all U.S. students. Studies comparing populations on the NAEP and another assessment must be done with great care because the tests being linked seldom have

OCR for page 48
--> the same format or content. The usual method of checking on the validity of the link has been to closely compare the content and format and to compare the results of linking with different subgroups. The ASVAB mathematics tests do not cover some aspects of mathematics that are in the NAEP assessment, especially geometry. However, the ASVAB link was constructed by projecting to the NAEP scale from a combination of all 10 subtests of the ASVAB, and some of the mechanical comprehension tests were judged to represent geometry concepts to some degree. Statistical checks were made by constructing separate links for low-scoring and high-scoring examinees. Both links provided very comparable results. A comparison of the link from ASVAB to NAEP showed very small differences. 1 However, doubt was cast on the validity of the link because it suggested a much lower standing of the military recruits on the NAEP scale than was indicated by norms for the ASVAB based on the full population of youth between 17 and 23 years of age. This large difference suggested to the researchers that the motivation of their recruits in the study was not high, probably since they knew that the test scores would not have any consequences for them (Bloxom et al., 1995). The second International Assessment of Educational Progress (IAEP) in science and mathematics was linked to NAEP in order to compare international achievement in terms of NAEP proficiency standards. In fact, two different links were developed. First, the IAEP distribution of achievement for the United States was compared with the NAEP 1992 distribution and aligned using statistical moderation (Beaton and Gonzales, 1993). Second, projection was used with a sample of students participating in the 1992 NAEP who also took the IEAP (Pashley and Phillips, 1993). The resulting projection of the IAEP scale to the NAEP scale could be checked by comparing actual NAEP results for the United States as a whole with the predictions of NAEP results from the IAEP results of the U.S. sample that had taken the IAEP. The differences of proportions of students' achieving at basic or above, proficient or above, and advanced, ranged from 0.01 to 0.03, corresponding to differences of from 0.01 to 0.10 standard deviations on the NAEP scale. Given the differences in the assessments, the link was very close. 1   In terms of effect sizes, which statisticians use to describe the meaning of observed numerical differences (see, e.g., Mosteller, 1995) the ASVAB-NAEP link was very close, 0.02 standard deviations.

OCR for page 48
--> The Third International Mathematics and Science Study (TIMSS) was linked to NAEP by statistical moderation, because there was insufficient funding to do a same-group study (National Center for Education Statistics, 1998). The results could be checked by using data from the state of Minnesota, which had participated as a unit in TIMSS, for 8th-grade mathematics and science. The NAEP performance of Minnesota students in mathematics, when linked with TIMSS, predicted that 6 percent of Minnesota students would place among the top 10 percent internationally; 7 percent scored at that level on the actual TIMSS. Also, 62 percent were predicted by the link to score in the top half of the international sample; in fact, 57 percent scored in the top half internationally. The NAEP-TIMSS link for the 8th-grade science assessment indicated that 16 percent of the students would be classified in the top 10 percent internationally, based on NAEP results; on the actual TIMSS, 20 percent of the students scored at that level. In science, NAEP results predicted that 69 percent of Minnesota students would be among the top half of the international distribution; 67 percent actually scored in the top half. These close matches are encouraging. Data from other states and other grades have not yet been reported. Linking Commercial and State Tests The earliest studies linking commercial and state assessments to the NAEP scale examined the efficacy of simple distribution-matching procedures for this purpose. Ercikan (1997) reported obtaining statistical links of the 8th-grade mathematics scale of the California Achievement Test (CAT) in four states by comparing the test results to the state's student performance on the NAEP trial state assessment. The four states used slightly different versions of the CAT, all of which were calibrated to a common scale by the publisher. The four links to NAEP had substantial differences. One way to express the differences is to note that the same score of 56 on the CAT scale, a score at about the average for 8th graders, would have translated to NAEP scale scores of 254, 260, 268, and 274, depending on the state.2 Another way to express the differences is to note that a score that appeared to be as good as 50 percent of the national 2   The NAEP score scale has a standard deviation, within grade, of about 30 to 35, so the smallest difference is about 0.18 standard deviation, and the largest difference is about 0.6 standard deviation.

OCR for page 48
--> student population, judging from the link observed in one state, would seem to be as good 57 percent to 72 percent of the national student population, using linkages from the other states. Links with this kind of discrepancy pose serious problems of interpretability. A distribution-matching method was also used by Linn and Kiplinger (1995) to link the NAEP results for 8th-grade mathematics with results on the Stanford Achievement Test in two states. In each state, the researchers then compared results based on obtaining separate links for males and females. The results showed virtually no difference in the linked scales at the upper end of the score range, but at the lower range, the difference was about 10 points on the NAEP scale, or about 0.3 standard deviation. The same low score (of 30) on the commercial test would translate into a considerably lower NAEP equivalent for a girl than for a boy. These authors also examined linkage consistency of state-NAEP links in four states across two years. They found differences of from 0.0 to 0.15 standard deviation between 1990 and 1992. Such differences might be considered small in some contexts, but they are disturbing when assessing national student accomplishments. These two studies show that distribution-matching methods are not well suited to this task. When a distribution-matching method is used to equate two tests, the tests should have the same format and content. Moreover, the groups taking the two tests should be formed by random assignment from a single population, and the tests should be given at the same time under the same conditions. In neither study was the content of the state assessment a close match with the NAEP content, and in neither study did the distributions represent the same population. In both cases, although the state assessment was intended to be given to every student, exceptions were sometimes made for students with physical or learning disabilities, and the testing was done at different times, in different ways. A much more specialized kind of equipercentile equating was developed in the early 1990s between the Kentucky Instructional Results Information System (KIRIS) and the NAEP achievement-level categories (Kentucky Department of Education, 1995). The results showed a modest relationship between achievement on the two tests. Because the results from KIRIS are reported only on a four-category scale—students are labeled novice, apprentice, proficient, or distinguished—this linkage was unusual in that it did not translate numerical scores on the statewide scale into numerical scores on the NAEP scale. Instead, it sought cut-points

OCR for page 48
--> on the NAEP scale that gave results matching those obtained with KIRIS.3 The early, relatively disappointing, results using distribution-matching methods led subsequent researchers to attempt to use projection when linking statewide assessments to the NAEP scale. Projection can include terms to account for possible variation in the relationship between the state assessments and the NAEP scale for various subpopulations. Including such group-level terms implies a corresponding change in the goals of the linkage. A projection that uses different relationships between two tests for different subgroups or under different “conditions" (an aspect of the procedure often referred to as "conditioning") is designed to produce accurate predictions of aggregated score distributions; it is generally not used to produce individual scores. The North Carolina Department of Public Instruction linked mathematics on its comprehensive academic testing program with the NAEP scale (Williams et al., 1995). The state's end-of-grade tests were closely aligned with the state curriculum, but not with NAEP, so content alignment was not close. Nevertheless, a study was done linking the North Carolina 8th-grade end-of-grade test in mathematics with the NAEP 8th-grade mathematics assessment. A representative sample of students took a short form of the end-of-grade test and a short form of the NAEP mathematics assessment, consisting of two blocks of related items. There was no requirement that individual scores be linked; the main objective was inference about the score distribution for the state in years when NAEP was not administered. This linkage used projection, involving a statistical regression of NAEP scores on the state assessment scores. For the analysis, scores on the NAEP blocks were determined from the student responses, using the NAEP item parameters in the framework of item response theory, but not including the elaborate conditioning analysis used by NAEP. The regular scores of these students on the North Carolina assessment were used. A satisfactory linking was obtained, which permitted statewide statistics on the NAEP scale (such as the mean, or the quartiles usually reported) to be predicted with only modest 3   If the linkage had been at the level of the test scores, the relatively modest relationship between NAEP and KIRIS achievement estimates would have given pause. It was possible to estimate correlations between the average scores for schools on the NAEP and KIRIS scales; for mathematics, those correlations were .74, .78, and .79 for grades 4, 8, and 12, respectively. Those correlations are considerably lower than are usually obtained between two equated tests.

OCR for page 48
--> statistical standard error bands. However, separate links for two ethnic groups showed differences of about 0.28 standard deviation. Some of the effect may be an artifact of the use of regression in developing the links. The practical difference would be between the link for the separate groups and the link for the combined group: taking those factors into consideration leaves a difference of at least 0.1 to 0.2 standard deviation on average, which could result in misleading interpretations of results. Links for Individual Proficiencies A recent study linked individual scores on assessments in four separate states to the 1996 NAEP mathematics assessment for both 4th and 8th grades (McLaughlin, 1998). Alignment of the content of the state assessments with the NAEP mathematics framework was better in some states than in others. In each case, scores of individual students in the NAEP sample in that state were compared with their scores on the state mathematics assessment. The objectives of each state were somewhat narrower than the full NAEP framework. The results differ by states in detail, but are similar in many respects. The link was slightly better for grade 8 than for grade 4. McLaughlin (1998:43) summarizes the results as follows: NAEP measurement error, which has a standard deviation of 9 or 10, except in State #1 (where it is 12.5), is not affected by the accuracy of the linkage. However, the prediction error, which is attributable to the linkage, has a standard deviation that ranges from 16 to 20. The sum of the two sources of error variance yield an estimate of the expected error in individually predicted NAEP plausible values, a standard deviation of 18 to 22 points on the NAEP scale. Thus, a 95 percent confidence interval would range across more than 70 points on the NAEP scale. Clearly the linkage does not support reporting individual NAEP scores based on state assessment information, no matter how reliable the state assessment. The negative result is not unexpected. Using actual NAEP results is a slim reed because each person takes such a short test. When the unreliability of NAEP results for individuals is coupled with differences between NAEP and state assessments and differences in the administration conditions, the resulting links cannot be expected to yield precise links. These threats to linkage cannot be overcome by simply using the same test takers on both assessments.4 4   It is important to keep the magnitude of this problem in the proper perspective, i.e., by comparing the degree of spread of the confidence interval after the linkage

OCR for page 48
--> Challenges to the Validity of Linkage to NAEP As discussed in Chapter 2, a number of features of tests, as well as the purposes and conditions of their use, influence the likelihood that valid links can be established. These features apply with equal force to the linking of any test to NAEP and to the interpretation of a student's scores on any test, including the proposed Voluntary National Tests, in terms of the NAEP scale or the NAEP achievement levels. The unique character of NAEP, which makes it unlike most state and commercial tests in design and implementation, poses significant challenges to linkage. Unique Characteristics of NAEP Content Coverage NAEP's distinctive characteristics present special challenges of content comparability with other tests (see, e.g., Kenney and Silver, 1997). NAEP content is determined through a rigorous and lengthy consensus process that culminates in "frameworks" deemed relevant to NAEP's principal goal of monitoring aggregate student performance for the nation as a whole. NAEP content is not supposed to reflect particular state or local curricular goals, but rather a broad national consensus on what is or should be taught; by design, its content is different from that of many state assessments (Campbell et al., 1994). Item Format Distribution Like its content, the format of the NAEP assessment is derived through a national consensus process and is unlikely to match precisely the format of particular state or commercial tests. The proposed format distribution of the Voluntary National Tests, for example, is 80 percent multiple-choice items and 20 percent constructed-response items, compared with the approximately 50-50 distribution of NAEP items across these format categories. The mix of item formats on a test makes an enormous difference in     with the size of the confidence interval before the linkage. In McLaughlin's study this comparison was not possible. The Pashley and Phillips study (1993) comes closer to permitting this kind of comparison.

OCR for page 48
--> relative performance, particularly with the use of achievement levels. As Linn et al. (1992) have shown, the distribution of students assigned to the NAEP achievement levels varies dramatically when those who establish NAEP achievement levels consider selected-response (e.g., multiple-choice) and extended-response test items. In the most extreme case, for one subject in the 1990 NAEP Trial State Assessment, 78 percent of examinees would have been placed in the "basic" achievement level or higher on the basis of the selected-response NAEP items, while only 3 percent would have been so placed on the basis of the extended-response NAEP items. Although corresponding differences were less extreme for other NAEP assessments, they were substantial. These findings suggest that the problem of congruence between the abilities and knowledge suggested by students' performances on the items of a linked test and the abilities and knowledge described by the NAEP achievement levels will be exacerbated to the extent that the item format distribution of a linked test differs markedly from that of NAEP. Test Administration As a national survey designed to monitor overall educational performance, NAEP is administered in different ways than many state and commercial tests. As a result, the conditions of test administration—from the time of year in which the test is administered to who (the classroom teacher or an external administrator) administers it—are likely to differ between state tests and NAEP. Such differences in test administration can affect test results and so could affect any links between NAEP and other tests. Test Use Because it does not currently produce individual student scores, NAEP is the prime example of a low-stakes test—one on which few consequences are associated with performance. As a result, teachers have little incentive to prepare students to perform well on NAEP, and students have little motivation to perform at their best (see, e.g., O'Neil et al., 1992; Kiplinger and Linn, 1996). State tests, in contrast, often have serious consequences associated with the results, and teachers and students place great emphasis on improving performance. This difference

OCR for page 48
--> could significantly threaten the quality of a link between a state test and NAEP. Moreover, the difference between the stakes attached to state tests and NAEP threaten the stability and robustness of linkages over time. Since state curriculum frameworks, adopted state curricula, and accountability pressures are likely to encourage teachers to teach to the state accountability test (or insist that they do so) and no such pressures exist for NAEP, a linkage between a high-stakes test and NAEP is unlikely to be robust over time. Students' improvements over time on a high-stakes statewide test are unlikely to be mirrored by commensurate gains on NAEP unless the statewide accountability test is congruent to its counterpart NAEP assessment in terms of content, format, and skill demands. Linking to NAEP Achievement Levels In addition to the challenges posed for linking presented by the unique features of NAEP's design, the use of NAEP achievement levels as a way of reporting results from a linked test pose other challenges. The NAEP achievement levels have been the subject of substantial discussion and controversy (Stufflebeam et al., 1991; Linn et al., 1991; Shepard et al., 1993; U.S. General Accounting Office, 1993; Cizek, 1993; Kane, 1993; National Research Council, 1999b). We do not engage that issue here, but we note that linking tests to NAEP for the purpose of categorizing individual students with respect to the NAEP achievement levels is a use that has not yet been considered: this use gives rise to novel opportunities as well as novel problems. The NAEP achievement levels carry labels—"basic," "proficient," and "advanced"—as well as paragraphs that describe the knowledge, skills, and abilities of students whose NAEP performance warrants assignment to those levels. These paragraphs, called achievement-level descriptors, represent subsets of the achievement domain that a NAEP assessment measures, which have presumably been mastered by students who are classified into the named achievement level. Use of linking to place individual students into NAEP achievement levels would provide an opportunity to demonstrate the validity of the NAEP achievement-level descriptors, as well as potential evidence of their validity or invalidity. Since the classification of individual students into NAEP achievement levels has not been done, there has been no public opportunity to compare the performances of individual students

OCR for page 48
--> on NAEP items and the descriptions of their knowledge, skills, and abilities provided by the NAEP achievement levels. A public report of the detailed results of a linking study would provide such an opportunity. In addition, the validity of the placement of individual students into NAEP achievement-level categories on the basis of their performances on a linked test could also be assessed. Again, a public report of the detailed results of a linking study would permit comparison of the performance of individual students on the linked test and the descriptions of their knowledge, skills, and abilities provided by the NAEP achievement levels. Such comparisons would provide direct evidence of the validity of important inferences that the linkage would claim to support. We do not know of any plans for such a study at this time.