5
Available Research on the Effects of Accommodations on Validity

From the perspective of score interpretation, the purpose of testing accommodations is to reduce the dependence of test scores on factors that are irrelevant to the construct that is being assessed. As Haertel (2003, p. 11) noted:

Ideally, the accommodation would eliminate some particular impediment faced by a given examinee, so that the accommodated administration for that examinee was equivalent to a standard accommodation for a typical examinee in all other respects. The score earned by the accommodated examinee would then be interpreted as conveying the same information with respect to the intended construct as a score obtained under standard administration conditions.

Thus in investigating the validity of inferences based on accommodated testing there are two paramount questions: Do the accommodated and unaccommodated versions of the test measure the same construct? If so, are they equivalent in difficulty and precision? Evidence that the answers to both questions is yes constitutes support for considering the two versions equivalent.

In the past several years, there have been numerous calls for research into accommodations for students with disabilities and English language learners. The National Research Council (NRC) (1999a), for example, called for a research agenda that includes studies of “the need for particular types of accommodations and the adequacy and appropriateness of accommodations applied to various categories of students with disabilities and English-language learners” and “the validity of different types of accommodations” (National Research Council, 1999a, pp. 110-111). Participants at an NRC workshop on reporting test results for students with disabilities and English language learners



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments 5 Available Research on the Effects of Accommodations on Validity From the perspective of score interpretation, the purpose of testing accommodations is to reduce the dependence of test scores on factors that are irrelevant to the construct that is being assessed. As Haertel (2003, p. 11) noted: Ideally, the accommodation would eliminate some particular impediment faced by a given examinee, so that the accommodated administration for that examinee was equivalent to a standard accommodation for a typical examinee in all other respects. The score earned by the accommodated examinee would then be interpreted as conveying the same information with respect to the intended construct as a score obtained under standard administration conditions. Thus in investigating the validity of inferences based on accommodated testing there are two paramount questions: Do the accommodated and unaccommodated versions of the test measure the same construct? If so, are they equivalent in difficulty and precision? Evidence that the answers to both questions is yes constitutes support for considering the two versions equivalent. In the past several years, there have been numerous calls for research into accommodations for students with disabilities and English language learners. The National Research Council (NRC) (1999a), for example, called for a research agenda that includes studies of “the need for particular types of accommodations and the adequacy and appropriateness of accommodations applied to various categories of students with disabilities and English-language learners” and “the validity of different types of accommodations” (National Research Council, 1999a, pp. 110-111). Participants at an NRC workshop on reporting test results for students with disabilities and English language learners

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments outlined a full agenda of research into the effects of accommodations (National Research Council, 2002a). Two different NRC committees that addressed the educational needs and concerns of students with disabilities (National Research Council, 1997a) and English language learners (National Research Council, 2000b) both recommended programs of research aimed at investigating the performance of these groups in large-scale standardized assessments. A substantial amount of useful and interesting research is already available on the effects of accommodations on test performance, and several extensive reviews of this literature have been conducted. The effects of accommodations on test performance have been reviewed by Chiu and Pearson (1999); Tindal and Fuchs (2000); and Thompson et al. (2002). However, much of the existing research focuses on whether or not the accommodation had an effect on performance and, in some cases, on whether the effect was different for students with and without disabilities. Little of the available research directly addresses the validity of inferences made from the results of accommodated assessments, yet it is this second kind of research that could really assist policy makers and others in making decisions about accommodations. In this chapter, we review the available research regarding accommodations and outline the current methods of conducting validity research. In Chapter 6 we present the committee’s view of the way the validity of inferences based on accommodated assessments can best be evaluated. EFFECTS OF ACCOMMODATIONS AND THE INTERACTION HYPOTHESIS The committee commissioned a review and critique of the available research on the effects of test accommodations on the performance of students with disabilities and English language learners in order to gauge both any discernible trends in this research and the thoroughness with which the issues have been studied. This review was conducted by Sireci et al. (2003). The authors were asked to review and critically evaluate the literature on test accommodations, focusing on empirical studies that examined the effects of accommodations on individuals’ test performance. The authors began their review with the articles summarized in the NRC’s workshop report (National Research Council, 2002a), additional articles provided by NRC staff, and questions raised about the studies during the workshop. They supplemented the lists provided by searching two electronic databases, ERIC and PsychoInfo, and the web sites of the Center for Research on Evaluation, Standards, and Student Testing (CRESST) and the National Center on Educational Outcomes (NCEO). They also queried researchers whose work was frequently cited, sent the authors the list of citations they had, and asked the authors to forward any additional studies. Included in the review were studies conducted between 1990 and December of 2002; the ending time was specified to ensure that the literature review would be ready in time for the committee’s first meeting.

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments The committee asked the authors to consider the research findings in light of the criterion, often referred to as the “interaction hypothesis” that is commonly used for judging the validity of accommodations, that is, the assumption that effective test accommodations will improve test scores for the students who need the accommodation but not for the students who do not need the accommodation. As Shepard et al. (1998) explained it, if accommodations are working as intended, there should be an interaction between educational status (students with disabilities and students without disabilities) and accommodation conditions (accommodated and unaccommodated). The accommodation should improve the average score for the students for whom it was designed (students with disabilities or English language learners) but should have little or no effect on the average score for the others (students without disabilities or native English speakers). If an accommodation improves the performance of both groups, then offering it only to certain students (students with disabilities or English language learners) is unfair. Figure 5-1 is a visual depiction of the 2 × 2 experimental design used to test for this interaction effect. An interaction effect would be said to exist if the mean score for examinees in group C were higher than the mean score for group A, and the mean scores for groups B and D were similar. The use of this interaction hypothesis for judging the validity of scores from accommodated administrations has, however, been called into question. In particular, questions have been raised about whether the finding of score improvements for the students who ostensibly did not need accommodations (from cell B to cell D) should invalidate the accommodation (National Research Council, 2002a, pp. 74-75). For example, if both native English speakers and English language learners benefit from a plain-language accommodation, does that mean that the scores are not valid for English language learners who received this accommodation? There are also questions about whether the finding of score improvements FIGURE 5-1 Depiction of the interaction hypothesis.

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments associated with the use of an accommodation is sufficient to conclude that an accommodation results in valid scores for the experimental group. At the heart of this latter question is the issue of the comparability of inferences made about scores obtained under different conditions. Sireci and his colleagues were given criteria for including a study in their review, specifically that the study should examine the effects of test accommodations on test performance and should involve empirical analyses. The authors found that while the literature on test accommodations is “vast and passionate,” with some authors arguing against accommodations on the grounds that they are unfair and others arguing in favor of them, only a subset of the literature explicitly addressed the effects of accommodations on performance using empirical analyses. The authors initially identified more than 150 studies that pertained to test accommodations; of these, however, only 46 actually focused on test accommodations and only 38 involved empirical analyses. They classified the studies as experimental, quasi-experimental, or non-experimental. A study was classified as using an experimental design if the test administration condition (accommodated or standard) was manipulated and examinees were randomly assigned to the condition. Studies were classified as quasi-experimental if the test administration condition was manipulated but examinees were not randomly assigned to conditions. Nonexperimental studies included ex post facto studies that compared the results of students who took a test with an accommodation with those of students who took a standard version of the test and studies that looked at differences across standard and accommodated administrations for the same (self-selected) group of students. Research on the Effects of Accommodations on the Test Performance of Students with Disabilities With regard to students with disabilities, Sireci et al. (2003) found 26 studies that met their criteria for inclusion in the review. The disability most frequently studied was learning disabilities, while the two accommodations most frequently studied were extended time (12 studies) and oral presentation (22 studies). Table 5-1 lists the studies that used an experimental design and provides a brief description of the findings; Table 5-2 provides similar information for studies that used quasi-experimental or nonexperimental designs. The authors summarized the findings from these studies this way (p. 48): One thing that is clear from our review is that there are no unequivocal conclusions that can be drawn regarding the effects, in general, of accommodations on students’ test performance. The literature is clear that accommodations and students are both heterogeneous. It is also clear that the interaction hypothesis, as it is typically described, is on shaky ground. Students without disabilities typically benefit from accommodations, particularly the accommodation of extended time.

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments Research on the Effects of Accommodations on the Test Performance of English Language Learners With regard to research on the effects of accommodations on test performance for English language learners, Sireci et al. (2003) found only 12 studies that met their criteria for inclusion in the review. Table 5-3 provides a list of the studies included in their review; those that are listed as using either a between-group design or a within-group design were considered to be experimental studies. The most common accommodations studied were linguistic modification, provision of a dictionary or bilingual dictionary, provision of dual-language booklets, extended time, and oral administration. Most studies examined the effects of multiple accommodations. Sireci et al. reported that research on the effects of linguistic modification has produced mixed results. For example, they cite a study by Abedi, Hofstetter et al. (2001) in which the authors claimed that this accommodation was the most effective in reducing the score gap between English language learners and native English speakers. However, Sireci et al. (2003, p. 65) point out that in this study, “the gap was narrowed because native English speakers scored worse on the linguistically modified test, not because the English language learners performed substantially better.” In addition, in a study by Abedi (2001a), significant, but small, gains were noted for eighth grade students but not for fourth grade students. Sireci et al. point out that Abedi explained this finding by hypothesizing that “With an increase in grade level, more complex language may interfere with content-based assessment” (p. 13) and “in earlier grades, language may not be as great a hurdle as it is in the later grades” (p. 14). With regard to research on other accommodations provided to English language learners, Sireci et al. noted that providing English language learners with customized dictionaries or glossaries seemed to improve their performance (e.g., Abedi, Lord, Boscardin, and Miyoshi, 2000). The one study available on dual-language test booklets revealed no gains. Overall Findings from the Literature Review From their review of 38 studies that involved empirical analysis, Sireci et al. concluded that, in general, all student groups (students with disabilities, English language learners, and general education students) had score gains under accommodated conditions. While the literature review did not provide unequivocal support for interpreting accommodated scores as both valid and equivalent to unaccommodated scores, it did find that many accommodations had “positive, construct-valid effects for certain groups of students” (p. 68). The reviewed studies focused on the issue of whether accommodations led to score increases, and whether the increases were greater for the targeted groups than for other test-takers. Evaluation of this interaction hypothesis has been cen-

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments TABLE 5-1 Summary of Experimental Studies on Students with Disabilities Study Characteristics of Sample Accommodations Design Results Interaction Detecteda McKevitt, Marquart, Mroch, Schulte, Elliott, and Kratochwill (2000) Students with disabilities Extra time, oral, encouragement, “packages” Single-subject alternating treatment design Greater gains for students with disabilities Yes Elliot, Kratochwill, and McKevitt (2001) Students with disabilities Encouragement, extra time, individual administration, various oral, spelling assistance, mark to maintain place, manipulatives Single-subject alternating treatment design Moderate to large improvement for students with disabilities Yes Runyan (1991) Students with learning disabilities Extra time Between-groups design Greater gains for students with disabilities Yes Zuriff (2000) Students with learning disabilities Extra time Five different studies Gains for both students with and without disabilities No Fuchs, Fuchs, Eaton, Hamlett, Binkley, and Crouch (2000) Students with learning disabilities Extra time, large print, student reads aloud Between-groups design Read aloud benefited students with learning disabilities but not others Yes Weston (2002) Students with disabilities Oral Within- and between-groups design Greater gains for students with disabilities Yes Tindal, Heath, Hollenbeck, Almond, and Harniss (1998) Students with disabilities Oral Within- and between-groups design Significant gains for students with disabilities only Yes

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments Johnson (2000) Students with disabilities Oral Between-groups design Greater gains for students with disabilities Partial Kosciolek and Ysseldyke (2000) Students with disabilities Oral Within- and between-groups design No gains No Meloy, Deville, and Frisbie (2000) Students with disabilities Oral Within- and between-groups design Similar gains for students with and without disabilities No Brown and Augustine (2001) Students with disabilities Screen reading Within- and between-groups design No gains No Tindal, Anderson, Helwig, Miller, and Glasgow (1998) Students with disabilities Simplified English Unclear No gains No Fuchs, Fuchs, Eaton, Hamlett, and Karns (2000) Students with learning disabilities Calculators, extra time, reading aloud, transcription, teacher selected Between-groups design Differential benefit on constructed response items Partial Walz, Albus, Thompson, and Thurlow (2000) Students with disabilities Multiday/session Within- and between-groups design No gains for students with disabilities No aAs portrayed in Figure 5-1.

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments TABLE 5-2 Summary of Quasi-experimental and Nonexperimental Studies on Students with Disabilities Study Characteristics of Sample Accommodations Design Selected Findings Cahalan, Mandinach, and Camara (2002) Students with learning disabilities Extended time Ex post facto Predictive validity was lower for learning disabled students, especially for males Camara, Copeland, and Rothchild (1998) Students with learning disabilities Extended time Ex post facto Score gains for learning disabled retesters with extended time were three times greater than for standard retesters Huesman and Frisbie (2000) Students with disabilities Extra time Quasi-experimental Score gains for students with learning disabilities but not for those without Ziomeck and Andrews (1998) Students with disabilities Extra time Ex post facto Score gains for learning disabled retesters with extended time were three times greater than for standard retesters Schulte, Elliot, and Kratochwill (2001) Students with disabilities Extra time, oral Ex post facto Students with disabilities improved more between unaccommodated and accommodated conditions (medium effect size; 0.40 to 0.80.) than those without disabilities (small effect size; less than 0.40). No differences on constructed response items Braun, Ragosta, and Kaplan (1986) Students with disabilities Various Ex post facto Predictive validity was similar across accommodated and unaccomodated tests; slightly lower for learning disabled Koretz and Hamilton (2000) Students with disabilities Various Ex post facto Students with disabilities performed lower than those without and differences increased with grade level. No consistent relations found between test item formats for students with disabilities

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments Koretz and Hamilton (2001) Students with disabilities Various Ex post facto Accommodations narrowed gap more on constructed response items Helwig and Tindal (2003) Students with disabilities Oral Ex post facto Teachers were not accurate in determining who would benefit from accommodation McKevitt and Elliot (in press) Students with disabilities Oral Ex post facto No significant effect size differences between accommodated and unaccommodated conditions for either group Johnson, Kimball, Brown, and Anderson (2001) Students with disabilities English, visual, and native language dictionaries, scribes, large print, Braille, oral Ex post facto Students with disabilities scored lower than those without. Accommodations did not result in an unfair advantage to special education students Zurcher and Bryant (2001) Students with disabilities Not specific Quasi-experimental No significant gains

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments TABLE 5-3 Summary of Quasi-experimental and Nonexperimental Studies on English Language Learners Study Accommodations Design Results Interaction Detecteda Abedi (2001b) Simplified English, bilingual glossary, customized dictionary Between-groups design No effects at fourth grade. Small gain for simplified English in eighth grade Only for eighth grade sample Abedi, Hofstetter, Baker, and Lord (2001)b Simplified English, glossary, extra time, extra time + glossary Between-groups design Extra time w/ and w/out glossary helped all students; simplified English narrowed the score gap between groups No Abedi and Lord (2001) Simplified English Between-groups design Small, but insignificant gains No Abedi, Lord, Boscardin, and Miyoshi (2000) English dictionary, English glosses, Spanish translation Between-groups design English language learner gains assoc. with dictionary; no gains for others Yes Abedi, Lord, and Hofstetter (1998) Linguistic modification, Spanish translation Between-groups design Language modification helped all students improve scores; performance on translated version depended on language of instruction No Rivera and Stansfield (2001) Linguistic modification of test Between-groups design No differences for non-English language learners No Albus, Bielinski, Thurlow, and Liu (2001) Dictionary Within -and between-groups design No effect on validity; no significant overall gain for English language learners No

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments Abedi, Courtney, Mirocha, Leon, and Goldberg (2001) Dictionary, bilingual dictionary, linguistic modification of the test, extended time Between-groups design Gains for English language learners under dictionary conditions Yes Shepard, Taylor, and Betebenner (1998) Various Ex post facto Gains for both English language learners and others Partial Anderson, Liu, Swierzbin, Thurlow, and Bielinski (2000) Dual-language booklet Within- and between-groups design No gains No Garcia et al. (2000) Dual-language booklet Quasi-experimental N/A N/A Castellon-Wellington, (1999) Extended time, oral Quasi-experimental No gains No Hafner (2001) Extended time, oral directions Quasi-experimental Score gains for both English language learners and others Unable to determine aAs portrayed in Figure 5-1. bResults from this study also appeared in another publication: Abedi, Lord, Hofstetter and Baker (2000). The 2000 publication in cluded a confirmatory factor analysis to evaluate the structural equivalence of reading and math tests for English language learners and native English speakers. The correlation between reading and math was higher for English language learners.

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments tral to much research on testing accommodations. Sireci et al., however, suggest a less stringent form of the hypothesis that stipulates that scores for targeted groups should improve more than scores of other test-takers. Although the results of investigating the interaction hypothesis (in either of its forms) are clearly useful in assessing the effectiveness of an accommodation, they cannot confirm that it yields valid score interpretations because they do not permit any determination of whether the accommodated and standard versions of the test are tapping the same constructs and whether they are equal in difficulty. Evidence that satisfies the interaction hypothesis criterion therefore does not constitute a sufficient justification for the use of an accommodation. As an illustration of the fact that the detection of an interaction is not evidence that the accommodated score is a more valid measure of the construct in question, consider the following example. Suppose that all students in a class take a spelling test in which they must write down words after hearing them read aloud. A week later, they take a second spelling test of equivalent difficulty. This time, test-takers are told that they can request a dictionary1 to use during the test. Suppose that this accommodation is found to improve the scores of English language learners but not those of students who are native English speakers. Proponents of the interaction hypothesis would say that this finding justifies the use of the accommodation. In reality, however, nothing in these results supports the claim that the accommodated scores are more valid measures of spelling ability. In fact, logic suggests in this case that the accommodated version of the test measures a skill that is quite different from the intended one. The fact that the accommodation affects English language learners and native English speakers differently may have any number of explanations. Native English speakers may have felt more reluctant to request a dictionary or been less likely to take the trouble to use one. Alternatively, they may have been close to their maximum performance on the first test and were not able to demonstrate substantial gains on the second test. Without some external evidence (such as an independent measure of spelling ability or, at least, of some type of verbal skill), no conclusion can be drawn about the validity of inferences from the accommodated scores relative to inferences from the scores obtained under standard administration. CURRENT VALIDITY RESEARCH How, then, can it be determined whether scores earned through accommodated and standard administrations are equivalent in meaning? Simply comparing 1   We recognize that use of a dictionary as an accommodation for a spelling test would typically not be allowed since it would interfere with measurement of the intended construct; however, we use this example here to demonstrate our point about the lack of logic associated with the interaction hypothesis as a criterion for validity.

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments score distributions for students who took the test under accommodated and standard conditions is clearly insufficient, since these two groups are drawn from different student populations. To compare the meanings of the two types of scores, it is necessary to obtain other information on the construct of interest by means of measures external to the test and to examine the relationship between the accommodated and unaccommodated scores and these external variables. In order for the two scores to be considered equivalent, the relationship between the test score and these criterion variables should be the same for both types of scores. In the area of admissions testing, in which there is some agreement on the appropriate criterion variable, some research of this kind has been conducted. At the recommendation of a National Research Council study panel, a four-year research program was undertaken during the 1980s under the sponsorship of Educational Testing Service, the College Board, and the Graduate Record Examinations Board (Willingham et al., 1988; see Zwick, 2002, pp. 99-100). The research program focused on issues involving candidates with disabilities who took the SAT or the (paper-and-pencil) GRE. The accuracy with which test scores could predict subsequent grades for students who tested with and without accommodations was investigated. The researchers concluded that in most cases the scores of test-takers who received accommodations were roughly comparable to scores obtained by nondisabled test-takers under standard conditions. The one major exception to this conclusion involved test-takers who were granted extended time. These students had typically been given up to 12 hours to complete the SAT and up to 6 hours for the GRE, compared with about 3 hours for the standard versions of these tests. In general, the students who received extended time were more likely to finish the test than were candidates at standard test administrations, but this finding in itself did not lead to the conclusion that time limits for students with disabilities were too liberal in general. For SAT-takers claiming to have learning disabilities, however, the researchers found that “the data most clearly suggested that providing longer amounts of time may raise scores beyond the level appropriate to compensate for the disability” (Willingham et al., 1988, p. 156). In particular, these students’ subsequent college grades were lower than their test scores predicted, and the greater the extended time, the greater the discrepancy. By contrast, the college performance of these students was consistent with their high school grades, suggesting that their SAT scores were inflated by excessively liberal time limits. Similar conclusions have been obtained in more recent SAT analyses (Cahalan et al., 2002), as well as studies of ACT and LSAT results for candidates with learning disabilities (Wightman, 1993; Ziomek and Andrews, 1996). Another study that made use of external data was an investigation by Weston (2002) of the validity of scores from an “oral accommodation” on a fourth grade math test based on the National Assessment of Educational Progress (NAEP). The accommodation consisted of having the test read aloud to students. The

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments sample included test-takers with and without disabilities. Each student took two matched forms of the test, one with the accommodation and one without. Weston collected external data in the form of teachers’ ratings of the students on 33 mathematical operations and teachers’ rankings of the students on overall math and reading ability. He hypothesized that “accommodated test scores will be more consonant with teachers’ ratings of student ability than non-accommodated tests” (p. 4). Weston concluded that there was some slight support for his hypothesis. While research that investigates relationships between assessment results and external criterion variables is valuable, it is important to note that in the context of K-12 assessment, there are few clearly useful criterion variables like the ones that can be compared with results from college entrances and certification tests. While teacher assessments and ratings, classroom grades, and other concurrent measures may be available, they are relatively unreliable. When weak relationships are found, it is difficult to know whether they indicate low levels of criterion validity or reflect the poor quality of the external criterion measures. Moreover, because prediction of future performance is not the purpose of NAEP or of state assessments, the evidence of validity of interpretations from these assessment results must be different in nature from that used for the SAT and similar tests. These issues are addressed in greater detail in Chapter 6. VALIDITY RESEARCH PLANNED FOR NAEP Staff of the National Center for Education Statistics provided the committee with an overview of studies currently in the planning stage that seek to answer questions about the validity of interpretations of results for accommodated administrations of NAEP. One of the planned studies involves examining the cognitive processes required for NAEP items through the use of a “think aloud” approach. NAEP-like items would be administered to small groups of students with disabilities and English language learners in a cognitive lab setting. Testing conditions would be systematically manipulated to study the effects of specific accommodations. This study is expected to provide information about the nature of the construct when students are assessed under accommodated and unaccommodated conditions. A second study will examine the effects of providing extended time for responding to NAEP-like assessment items in reading and mathematics. Students with and without disabilities will be asked to respond to both multiple-choice and constructed response items under standard and extended timing conditions. An alternate performance measure will also be administered to allow for investigation of criterion validity (this is an example of a criterion variable that is relatively reliable for the K-12 context). A third study will focus on the effects of providing calculators as an accommodation. Currently NAEP does not permit the use of calculators in the portions

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments of the mathematics assessments that evaluate computational skills. As part of this study, students with and without disabilities will take the mathematics assessment with and without the accommodation. Performance on both kinds of items (those assessing computation and those not assessing computation) will be compared for both kinds of accommodations and both disability conditions. Data on an external criterion of mathematics skills (e.g., grades in mathematics courses) will also be collected so that criterion validity can be investigated. SUMMARY AND RECOMMENDATIONS FOR FUTURE VALIDITY RESEARCH Determining whether accommodated scores are more valid representations of students’ capabilities than unaccommodated ones requires that external data on the students’ capabilities be obtained. Some possible external measures or criteria are teacher ratings, self-ratings, grade-point averages, course grades, and scores on alternative tests such as state tests. Analyses can then be conducted to determine whether the association between the test score of interest and the criterion variables is the same for accommodated and unaccommodated versions of the test. A conclusion that the association is indeed the same supports the validity of inferences made from accommodated scores. Like all validity research, this type of analysis is more complicated in practice than in principle. First, the identification of an appropriate criterion measure may not be straightforward. Because college admissions tests are intended to predict college grades, the criterion variable for the Cahalan et al. (2002) study was relatively clear-cut, but this will not be true in the majority of cases. Moreover, as has been noted, suitable criterion variables are much less readily available in the K-12 context than in college admissions testing and other contexts, and those that are readily available are not very reliable. Second, it may be difficult to obtain data on the criterion once it is identified. That is, it is difficult to obtain external data that might be useful, such as teacher ratings or grades. Moreover, asking tested students to take a second assessment in order to provide a criterion measure is difficult in an environment in which most children are already spending significant amounts of their time being tested for various purposes. Obtaining external data is especially difficult for NAEP, in which individual participants are not ordinarily identified. Third, except in an experimental setting like those in Weston (2002) and the criterion validity studies proposed for NAEP, the determination of whether the test-criterion relationships are the same for accommodated and unaccommodated administrations is complicated by the confounding of disability or English language learner status and accommodation status. That is, in a natural setting, those who use accommodations are likely to be students with disabilities or English language learners, and they are likely to differ from other students on many dimensions. Sireci et al. (2003, p. 25) allude to one aspect of this point in their

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments remarks on the Cahalan et al. (2002) study: Sireci et al. point out that students with and without disabilities are likely to differ in course-taking patterns, and that this disparity should be taken into account when comparing the degree to which accommodated and unaccommodated SAT scores predict college grade-point average. Moreover, the differences among students with disabilities and English language learners in course-taking, general courses of study, teacher assignments, the instructional methods they are likely to experience, and the like, all compound the difficulty of obtaining usable criterion variables. A final limitation of this type of validity assessment is that the accuracy of the criterion measure may differ across student groups, making it difficult to determine whether the test-criterion relationships are the same. For example, Willingham et al. (1988) and Cahalan et al. (2002) found that the correlations between admissions test scores and subsequent grade-point averages were smaller for candidates with disabilities. Willingham et al. found that the correlations between previous grades and subsequent grades were also smaller for students with disabilities. They speculated that one reason that grades were predicted more poorly for students with disabilities may be the exceptionally wide range in the quality of educational programs and grading standards for these students. These individuals may also be more likely than other students to experience difficulties in college or graduate school that affect their academic performance, such as inadequate support services or insufficient funds to support their studies. A considerable amount of research into the effects of accommodations on test performance for students with disabilities and English language learners has been conducted to date. However, this research fails to provide a systematic, comprehensive investigation into the central issue of the validity of interpretations of scores from accommodated versions of assessments. Numerous reviews of the research into the effects of accommodations on test performance assessments (e.g., Chiu and Pearson, 1999; Tindal and Fuchs, 2000; Thompson et al., 2002; Sireci et al., 2003) make clear that the findings from existing research are inclusive and insufficient for test developers and users of test data to make informed decisions about either the appropriateness of different accommodations or the validity of inferences based on scores from accommodated administrations of assessments. The problems are twofold. First, taken as a whole, the body of research suggests contradictory findings, and solid conclusions cannot be drawn from it. For example, Thompson et al. (2002, p. 11) reviewed seven studies in which extended time was the accommodation. In four of these, extended time had a “positive effect on scores”; in three extended time had “no significant effect on scores.” Similarly, in nine studies they reviewed on the effects of allowing computer administration, four showed “positive effects on scores,” three showed “no significant effects,” and two showed that it “altered item comparability.” Diverse results such as these illustrate the difficulties facing policy makers who want to rely on the research in developing policy.

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments Second, in our view, research that examines the effects of accommodations in terms of gains or losses associated with taking the test with or without accommodations is not a means of evaluating the validity of inferences based on accommodated scores. Such research does not provide evidence that scores for students who take a test under standard conditions are comparable to scores for students who take a test under accommodated conditions or that similar interpretations can be based on results obtained under different conditions. Thus the committee concludes that: CONCLUSION 5-1: Most of the existing research demonstrates that accommodations do affect test scores but that the nature of the effects varies by individual. CONCLUSION 5-2: For the most part, existing research has investigated the effects of accommodations on test performance but is not informative about the validity of inferences based on scores from accommodated administrations. That is, existing research does not provide information about the extent to which inferences based on scores obtained from accommodated administrations are comparable to inferences based on scores obtained from unaccommodated administrations. Furthermore, the research does not provide definitive evidence about which accommodations would produce the most valid estimates of performance. Based on these findings, the committee believes that a program of research is needed that would systematically investigate the extent to which scores obtained from accommodated and unaccommodated test administrations are comparable and support similar kinds of inferences about the performance of students with disabilities and English language learners on NAEP and other large-scale assessments. RECOMMENDATION 5-1: Research should be conducted that focuses on the validation of inferences based on accommodated assessments of students with disabilities and English language learners. Further research should be guided by a conceptual argument about the way accommodations are intended to function and the inferences the test results are intended to support. This research should include a variety of approaches and types of evidence, such as analyses of test content, test-takers’ cognitive processes, and criterionrelated evidence, and other studies deemed appropriate. CONCLUSION Available research has not adequately investigated the extent to which different accommodations for students with disabilities and English language learners may

OCR for page 85
Keeping Score for All: The Effects of Inclusion and Accommodation Policies on Large-Scale Educational Assessments affect the validity of inferences based on scores from NAEP and other large-scale assessments. While research has shed some light on the ways accommodations function and on some aspects of their effects on test performance, in the committee’s view, a central component of validity has been missing from much of this research. Without a well articulated validation argument that explicitly specifies the claims and intended inferences that underlie NAEP and other assessments, and that also explicitly specifies possible counterclaims and competing inferences, research into the effects of accommodations on assessments of students with disabilities and English language learners is likely to consist largely of a series of loosely connected studies that investigate various accommodations, more or less at random. An approach and a framework for articulating a logical validation argument for an assessment is discussed in the next chapter.