7
Summing Up: Synthesis of Issues and Directions for Future Study

The daylong workshop concluded with a panel of discussants. This panel summarized and synthesized the ideas presented by previous speakers and highlighted concerns and directions for future study. Their remarks are summarized in this chapter.

TESTING IS A BENEFIT

The discussants underscored one issue that permeated the day’s discussions—if students with special needs are not included in assessments, states are, in effect, excused from being accountable for their performance. Further, if scores for accommodated examinees are not reported or included in aggregate reports, there is no incentive to care about those students’ test performance. Eugene Johnson, chief psychometrician with the American Institutes for Research, reiterated Arthur Coleman’s point that the attitude of the law is that testing is considered to be a benefit for the tested children. Thus, states and other testing programs are obligated to ensure that all students have access to the test or an equivalent alternative, particularly in high-stakes situations.

ADAPTING TEST DESIGN TO TEST PURPOSE

Discussants also returned to another key point made by Coleman—the importance of clearly articulating both the purpose of any given assess-



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 70
Reporting Test Results for Students with Disabilities and English-Language Learners: Summary of a Workshop 7 Summing Up: Synthesis of Issues and Directions for Future Study The daylong workshop concluded with a panel of discussants. This panel summarized and synthesized the ideas presented by previous speakers and highlighted concerns and directions for future study. Their remarks are summarized in this chapter. TESTING IS A BENEFIT The discussants underscored one issue that permeated the day’s discussions—if students with special needs are not included in assessments, states are, in effect, excused from being accountable for their performance. Further, if scores for accommodated examinees are not reported or included in aggregate reports, there is no incentive to care about those students’ test performance. Eugene Johnson, chief psychometrician with the American Institutes for Research, reiterated Arthur Coleman’s point that the attitude of the law is that testing is considered to be a benefit for the tested children. Thus, states and other testing programs are obligated to ensure that all students have access to the test or an equivalent alternative, particularly in high-stakes situations. ADAPTING TEST DESIGN TO TEST PURPOSE Discussants also returned to another key point made by Coleman—the importance of clearly articulating both the purpose of any given assess-

OCR for page 70
Reporting Test Results for Students with Disabilities and English-Language Learners: Summary of a Workshop ment and the constructs being measured. Testing programs have a responsibility to ensure that accommodations provide access to the targeted constructs while also preserving them—this requires a clear understanding of what the assessment is measuring. The quandary for testing programs is how to change the way the construct is assessed without changing the meaning of the scores. This task could be simplified somewhat if test developers were clearer about what tests are designed to measure. Stephen Elliott introduced the notion of access skills and target skills1 and encouraged test publishers to be clearer about the target skills their tests are meant to assess. Several of the discussants and presenters called for better test design. Johnson urged consideration of ways to construct tests from the outset to minimize the effects of and the need for accommodations. For instance, much of Jamal Abedi’s work has demonstrated that language simplification and use of tailored glossaries help English-language learners as well as general education students. Perhaps test developers could use simplified language from the outset in writing items and could provide glossaries for words whose definitions do not reveal answers to test questions. Richard Durán, professor at the University of California, Santa Barbara, advised that when writing test items, test developers should keep in mind the underlying purpose of the test. If understanding text written in the passive voice is not one of the targeted skills a test is designed to measure, items should be written in the more familiar active voice. Test developers should be sensitive to vocabulary usage and avoid unfamiliar words that are not related to the construct being measured. Several discussants urged exploration of the ways technology can be used to eliminate barriers to the measurement of a target skill. VARIABILITY IN STATES’ POLICIES Another of the discussants’ observations was that while every state is including students with special needs and allowing some type of accommodations, there are wide disparities in states’ policies. State policies vary with respect to what accommodations are acceptable, who should receive them, how they should be implemented, whose scores should be included in score reports, and how scores should be reported. Some states also apply differ- 1   Target skills are measured by the assessment. Access skills are the skills needed to demonstrate performance on the target skills.

OCR for page 70
Reporting Test Results for Students with Disabilities and English-Language Learners: Summary of a Workshop ent accommodation and reporting policies to different state tests, and some allow accommodations that exceed those permitted by NAEP. Furthermore, the decision about what accommodations are acceptable seems to be based largely on intuition, in part because of a slim research base. The implications of this variability are discussed below. Variability in Policies Complicates Comparisons of Aggregated Results Margaret Goertz, co-director of the Consortium for Policy Research in Education, stressed that standardization in policies is particularly important if policy makers want to compare student assessment results across states or between states and NAEP. Because states use different assessments and often test students at different grade levels, the only way to compare student performance across states is through the state NAEP program. However, the inferences that can be based on such comparisons are limited when states have different accommodation and inclusion policies. At present, such comparisons carry relatively low stakes for states. However, ranking in the bottom of the group may put public pressure on policy makers and educators to change instructional practice. For example, in California, low rankings led to public pressure to replace “whole language” with phonics-based reading instruction. But states do not receive rewards or suffer sanctions if they perform above or below one another. Goertz speculated that different types of comparisons will be required under the recently passed legislation in which NAEP is expected to be used as a benchmark for comparisons with the outcomes of state assessments. For states, such comparisons are likely to be associated with higher-stakes decisions. It is possible that two types of comparisons could be made: (1) the percentage of students scoring the equivalent of “basic” or “proficient” under state standards compared to those students scoring “basic” or “proficient” on NAEP; and (2) changes over time in the percentage of students scoring in those categories on state assessments and NAEP. In either case, differences in accommodation and reporting policies between the state program and NAEP become more important. If a state’s accommodation and reporting policies are more liberal, it could include more special needs (and potentially lower-scoring) students in its assessment than NAEP. The analyses conducted by John Mazzeo and his colleagues with the 1998 and 2000 assessments demonstrated that when inclusion rates were higher, mean per-

OCR for page 70
Reporting Test Results for Students with Disabilities and English-Language Learners: Summary of a Workshop formance was lower. Thus, it is not clear what conclusions can be drawn about the findings from such comparisons. Variability in Policies Complicates Comparisons of Disaggregated Results Currently, NAEP does report disaggregated data for special needs students. However, because states are requird to report disaggregated results for their own assessments, workshop participants contemplated what might happen if NAEP were to adopt a similar reporting policy. They pointed out if comparisons are to be made between NAEP and state assessment results, the lack of alignment between the accommodations and reporting policies of NAEP and of the states will become even more critical. Students with disabilities and English-language learners are defined differently by different states. Durán questioned whether it would be reasonable to attempt to compare the performance for the two groups of students on statewide achievement tests and on NAEP. For English-language learners, in particular, Durán finds that such comparisons may be confounded by the differences in the way they are included in state assessments and in NAEP. He noted that English-language learners participating in NAEP are a heterogeneous mixture of non-English background students across states. One upshot of this heterogeneity is that the data will not be comparable across states because different student populations are involved. Variability in Implementing Policy Another source of variability is in the way state policies are implemented. David Malouf, educational research analyst with the Office of Special Education Programs at the Department of Education, pointed out that decision making about which students receive which accommodations is primarily the responsibility of the IEP team, which has considerable flexibility in selecting accommodations needed to enable a child with a disability to participate. Malouf finds that IEP teams are frequently not well informed about the consequences of their decisions. Based on the day’s discussions, he believes that IEP team decisions are clearly suspect. This is an important consideration for NAEP because NAEP accommodations are influenced by accommodations called for in the IEP. In addition, Durán noted that it is often the case that states comply “in word” with federal policies regarding maximizing participation of English-language learners in

OCR for page 70
Reporting Test Results for Students with Disabilities and English-Language Learners: Summary of a Workshop state assessments. But the way states proceed with identifying students and administering accommodations can vary greatly and has implications for interpretation of state assessment results and NAEP results. Changes in States’ Policies Complicate Interpretation of Trends Goertz discussed the impact of changes in policy, practice, and demographics on reported results for accommodated students and on tracking student performance over time. She described four important sources of change identified by speakers: student demographics; how students with disabilities and English-language learners are served; state assessment policy on who is tested in what areas and with what kinds of tests; and state accommodation and reporting policies. Work by Thurlow (2001a), Rivera et al. (2000), and Golden and Sacks (2001) demonstrates how states are constantly refining their assessment, accommodation, and reporting policies—generally to make them more inclusive. Thus, changes in student scores, especially if scores are disaggregated for students with disabilities and English-language learners, could reflect which students are included in the assessment or in the reporting category at any given point in time, as well as measurable changes in student achievement. EVALUATING THE VALIDITY OF ACCOMMODATIONS As Peggy Carr, associate commissioner for assessment at the National Center for Education Statistics, asked, do accommodations level the playing field for students who receive them or do they provide an advantage? As described in Chapter 6, this question is often evaluated by testing for the presence of the interaction effect2 discussed earlier (see Figure 6–1). Malouf and Johnson questioned the usefulness of the interaction effect as the basis for judging the validity of scores from accommodated conditions. Johnson 2   That is, the performance of students in a target population (e.g., students with disabilities) is compared with and without accommodations, and a similar comparison is made for the general student population. If the accommodation boosts the performance of the students in the target population but not that of the general population, the accommodation is regarded as valid—that is, the inference can be made that the accommodation compensates for the students’ specific weakness (e.g., disability or lack of English proficiency) but does not alter the construct being measured.

OCR for page 70
Reporting Test Results for Students with Disabilities and English-Language Learners: Summary of a Workshop expressed concern about confounding between the construct being measured and the accommodation. That is, performance on the construct may rely on skills that are not the intended focus of the assessment. Accommodations may assist examinees with these skills and thus help general education students as well as those with identified special needs. Malouf echoed this, noting that while experimental researchers are increasingly using the interaction criterion, it requires further discussion. He called for psychometricians and others with expertise in large-scale assessment to further examine the utility and integrity of the interaction concept in the context of both statewide assessments and NAEP. Durán voiced similar concerns, urging the educational measurement field to reconsider its notion of what constitutes an “inappropriate” or “invalid” accomodation. He asked, “Can we turn fear about how an assessment accommodation might distort measurement of proficiency on the targeted construct into figuring out how accommodations help measure examinees’ maximum proficiency on the construct?” Durán finds that popular views of acceptable accommodations often result from confusion about what is being measured. As an example, Durán offered psychometricians’ general disapproval of extended time as an acceptable accommodation. He argued that if speed is not a target skill and extended time leads to better performance for some students, there should be no problem with lengthening the time to complete the test (aside from the possible administrative burden). If the desire is to measure “speediness” in information processing, it should have been built into the definition of the targeted construct. He maintained that the finding that additional time increases the performance of general education students, as well as those with special needs, is not an issue as long as an assessment is not intended to be speeded. He encouraged the adoption of the concept of “construct-enabling” resources, that is, permitting resources that allow for better assessment of the targeted construct. Durán cautioned, however, that building speediness into the definition of a construct could pose additional problems. For example, he noted that it is well known in the field of cognitive studies of bilingualism that individuals perform problem-solving tasks more slowly in a second language. Cognitive cross-cultural research has shown that speediness in performing problem-solving tasks is affected by culturally based socialization processes affecting how fast problem solvers approach tasks. Thus, identifying speediness as a key aspect of a content-related construct could prove problematic.

OCR for page 70
Reporting Test Results for Students with Disabilities and English-Language Learners: Summary of a Workshop RESEARCH NEEDS All of the discussants noted that although much research has been conducted on the effects of specific accommodations, many questions remain unanswered. The findings from various studies contradict each other and do not assist practitioners and policy makers in determining “what works.” The discussants called for more research, particularly studies that utilize the within-subject randomized design described by Elliott and Gerald Tindal, in which each student serves as his or her own control, and small-scale experiments, particularly at the state level. In addition, each called for certain types of studies, as described below. Research Should Use Refined Categories Malouf pointed out that in most of the research discussed at the workshop, the target population was defined on the basis of a broadly-defined category—disabled versus non-disabled, English-language learners versus native-English speakers, learning disabled versus non-learning disabled, and so on. Malouf thinks that these broad categories should be replaced by specific student characteristics—reading disabled, native Spanish speaker and so on. He believes this would help in several regards. For one, IEP teams should not base their accommodation decisions on categories of disability, but instead on individual factors. Hence, research will be more useful if it focuses on the types of characteristics that IEP teams should consider. In addition, categorical labels are very gross descriptors, and there can be substantial within-category variation that mediates the effects of an accommodation, making the effects difficult to detect. Understanding the Meaning of Aggregated Results Johnson contemplated the meaning of test reports that combine data for accommodated and nonaccommodated test takers, given the current state of research on the comparability of results from different administrative conditions. He noted that some states are adjusting scores for accommodations by dropping the accommodated student two grade levels. He questioned whether this was a wise procedure or if some other adjustment procedure would be warranted, noting that either way experimentation is

OCR for page 70
Reporting Test Results for Students with Disabilities and English-Language Learners: Summary of a Workshop needed to decide how to combine the accommodated and nonaccommodated data. Further research is needed on the comparability of the results of various accommodations to the nonaccommodated results and on the comparability of the results of various accommodations to each other. Johnson suggested that it would be valuable to match the comparisons to actual state practices for measuring average yearly progress (for example, Oregon includes English-language learners in its aggregates, South Dakota excludes them). Such analyses should involve experimenting with the effects of various reporting and exclusion strategies. Conducting Research Through Cognitive Laboratories Johnson and Durán encouraged use of cognitive laboratories as a means for determining whether lack of access skills impede measurement of target skills. With cognitive laboratories, students work one-on-one with an administrator and answer test questions by thinking out loud. The administrator observes and records the thought process students use in arriving at their answers. Cognitive labs would allow researchers to compare how students with various disabilities react to the questions under different accommodations and to do further study into what constituted appropriate accommodations. Further Research on the Performance of English-Language Learners Durán commented that better understanding of the achievement of English-language learners depends on improvements in access to appropriate assessment accommodations for these students. He called for additional work to develop ways to evaluate the English proficiency of non-native English speakers. This is a particularly urgent issue in light of the recently passed legislation. He also encouraged researchers to examine the relationships between performance of achievement tests and relevant background variables, such as length of residence in the U.S., years of exposure to instruction in English, English-language proficiency levels, the characteristics of school curriculum, availability of first- and second-language resources, and other factors that interact to create different patterns of performance on assessments.

OCR for page 70
Reporting Test Results for Students with Disabilities and English-Language Learners: Summary of a Workshop ISSUES SPECIFIC TO NAEP How Much Inclusion Is Enough? Malouf raised questions about what rate of participation should be expected with NAEP. The presentations and his own examination of NAEP publications indicate that inclusion rates rarely climb much above 70 percent of the students with disabilities and are usually lower. He wondered what the basis might be for judging whether this rate of inclusion was high enough, asking “Should our expectations be based on technical limits, or should they be based on other considerations?” Malouf called for reconsideration of what it means to “take part meaningfully” in the nation’s educational system, and he urged NAEP’s sponsors to determine ways that all students can participate. Pressure to Disaggregate The discussants revisited the issue of providing disaggregated results. Goertz reminded participants that states are required to report these comparisons on their state tests. NAEP’s sponsors have yet to specify their plans for using data from the national or state NAEP programs to report on the performance of students with disabilities compared to that of non-disabled students and the performance of English-language learners compared to that of native speakers. Johnson maintained that it is inevitable that there will be strong pressure on NAEP to report disaggregated results for students with disabilities and for English-language learners. Although at this time sample sizes are not large enough to allow reliable reporting at the disaggregated level, NAEP’s future plans for combining state and national samples may produce large enough samples to allow for disaggregation of various groups of students with disabilities. Johnson foresees that when this happens, NAEP will not be able to withstand the pressure to report disaggregated results. Additional Research Is Needed Malouf also recommended that additional research be conducted on the effects of accommodations on NAEP scores. He finds that the IRT (item response theory) and DIF (differential item functioning) analyses discussed by Mazzeo are broad in focus and treat accommodations as a

OCR for page 70
Reporting Test Results for Students with Disabilities and English-Language Learners: Summary of a Workshop single factor, sometimes even combining students with disabilities and English-language learners into a single population. Malouf suggested that NAEP researchers find ways to increase sample sizes to allow study of the effects of specific accommodations and to conduct more fine-grained analyses of accommodations and NAEP.