Accountability and Assessment
Public accountability has always been a hallmark of public schooling in the United States, although it takes a variety of forms. For example, in casting its votes for school board and state legislative candidates, the public holds elected officials accountable for educational quality. Policy makers, in turn, hold professional educators accountable when they decide under what conditions schools will be funded, how curriculum and instruction will be regulated, and how high performance will be rewarded and low performance sanctioned. The assumption in all these transactions is that a social contract exists between communities and their schools: the public supports and legitimates the schools and, in exchange, the schools meet the community's expectations for educating its children.
No matter what type of accountability mechanisms are imposed on schools, information about performance lies at the core. Only with public reporting on performance can policy makers and the public make informed decisions, and only with reliable and useful data do educators have the information necessary to improve their work. Data on school performance are varied and include revenue and expenditure reports, descriptions of school curricula, and student attendance records. But assessments of student achievement are the most significant indicator for accountability purposes. In fact, over the past 20 years, student scores on standardized tests have become synonymous with the notion of educational accountability.
The accountability system for general education differs in two major ways from that for special education: it is public, and it typically focuses on aggregate student performance. In contrast, for special education, accountability is centered on the individualized education program (IEP), an essentially private document that structures the educational goals and curriculum of an individual student
and then serves as a device for monitoring his or her progress. The accountability mechanisms for general and special education are not inconsistent with one another and, for students with disabilities, the IEP serves as the major vehicle for defining their participation in the common, aggregated accountability system. Nevertheless, if students with disabilities are to participate in standards-based reform, their individualized educational goals must be reconciled with the requirements of large-scale, highly standardized student assessments.
The education standards movement has emphasized assessment as a lever for changing curriculum and instruction, at the same time continuing and even amplifying its accountability purposes. Indeed, assessment has often been the most clearly articulated and well-publicized component of standards-based reform.
The appeal of assessment to policy makers who advocate education reform is understandable. Compared with other aspects of education reform, such as finding ways to implement and fund increased instructional time; improve recruitment, professional development, and retention of the most able teachers; and reduce class size, assessments are relatively inexpensive, can be externally mandated and implemented quickly, and provide visible results that can be reported to the press (Linn, 1995).
The preeminent role of assessment in standards-based reform has also attracted considerable controversy. Some observers have cautioned that a heavy reliance on test-based accountability could produce unintended effects on instruction. These include ''teaching to the test" (teachers giving students practice exercises that closely resemble assessment tasks or drilling them on test-taking skills) and narrowing instruction to emphasize only those skills assessed rather than the full range of the curriculum. Indeed, research suggests that raising assessment stakes may produce spurious score gains that are not corroborated by similar increases on other tests and do not reflect actual improvements in classroom achievement (Koretz et al., 1991, 1992; Shepard and Dougherty, 1991; Shepard, 1988, 1990).
Analysts have also questioned the potential effects of assessment-based accountability on low-achieving students. Will schools choose to focus their efforts on students closest to meeting acceptable performance levels? What happens to students who fail to meet performance standards? Observers have questioned whether the same assessments can fulfill both their intended roles of measuring performance and inducing instructional change. Researchers have also raised concerns about the technical difficulties of designing and implementing new forms of assessment (Hambleton et al., 1995; Koretz et al., 1996a).
These potential effects do not appear to have dampened enthusiasm for assessment as a lever for reform; the basic purposes and uses of assessment in standards-based reform are proceeding unchanged.
Many students with disabilities, however, are exempted from taking common assessments for a variety of reasons, including confusion about the kinds of testing accommodations that are available or allowable, local concerns about the
impact of lower scores on average performance, concerns about the impact of stressful testing on children, and difficulties in administering certain tests to students with severe disabilities. But regardless of the reason, many students with disabilities who are exempted from assessments are not considered full participants in other aspects of the general curriculum. And if the performance of these students does not count for accountability purposes, then there may be less incentive for educational agencies to try to enhance their educational offerings and improve their performance. Eliminating these assessment barriers is therefore an important component of efforts to include more students with disabilities in standards-based reform.
Efforts to increase participation of students with disabilities in assessment programs reflect two distinct goals. One goal is to improve the quality of the educational opportunities afforded students with disabilities. For example, some reformers maintain that holding educators accountable for the assessment scores of students with disabilities will increase their access to the general education curriculum. A second goal is to provide meaningful and useful information about the performance of students with disabilities and about the schools that educate them. Although they recognize that student test scores alone cannot be used to judge the quality of a particular school's program, reform advocates assume that school-wide trends in assessment scores and the distribution of those scores across student groups, such as those with disabilities, can inform parents and the public generally about how well a school is educating its students. Ideally, an assessment program should achieve both goals.
With efforts to include increasing numbers of students with disabilities in standards-based reform, questions about assessment remain pivotal. For example, are assessments associated with existing standards-based reform programs appropriate for students with disabilities? The answer to this question may well depend on the nature of a student's disability, the nature of the assessment program, whether accommodations (i.e., modified testing conditions) are provided, and whether accountability rests at the student, school district, or state level. If accommodations are provided, what are their effects on the validity of the assessment? Should scores earned when accommodations are provided be so indicated with a special notation in score reports? Many students with disabilities spend part of their school day working on basic skills, reducing their opportunity to learn the content tested by standards-based assessments. Is it fair, then, to hold them to standards of performance comparable to their peers without disabilities?
In the remainder of this chapter, we first provide an overview of accountability systems in standards-based reform. We then consider the role of assessment systems in standards-based reform. The next section describes the current participation of students with disabilities in state assessment programs. The fourth, and longest, section focuses on the necessary conditions for increasing their participation in large-scale assessments, with particular attention to reliability and validity considerations, the design of accommodations, test score reporting, the
legal framework, and resource implications. The following section discusses implications of increased participation, and a final section presents the committee's conclusions.
Our focus on the assessment of students with disabilities in the context of standards-based reform has precluded consideration of a number of more general issues concerning assessment of children with disabilities. Examples of key issues that are not addressed include proposed changes in the IQ-achievement discrepancy criterion used to identify students with learning disabilities (see Morison et al., 1996) and other issues related to assessment for program eligibility purposes and preparation of the IEP.
OVERVIEW OF ACCOUNTABILITY SYSTEMS
Accountability systems are intended to provide information to families, elected officials, and the public on the educational performance of students, teachers, schools, and school districts, to assure them that public funds are being used legitimately and productively. In addition, some accountability systems are intended to provide direct or indirect incentives to improve educational outcomes. Assessment results are usually the centerpiece of educational accountability systems. The intended purpose and the design of accountability systems affect the type of assessments that are used, how the assessment data are collected, how they are reported, and the validity standard to which assessment results are held.
The different purposes of accountability systems lead to distinctions that result in quite different assessment system designs. The first critical factor is the unit to which accountability is directed. Although some systems are geared to provide state-level accountability, these systems build on data collected about districts, schools, and individuals. Most standards-based reforms rest accountability at the district and school levels. Some systems, such as that of Tennessee, focus on classrooms. In addition, some reform programs seek to provide individual-level accountability by giving parents explicit information about the current status, progress, and relative educational performance of their children. This latter kind of accountability is particularly relevant for students with disabilities.
The second important distinction is the relevant comparison group in the accountability system. There are three common alternatives. The most basic system provides information that simply allows comparisons among similar units (districts, schools, teachers, or individuals). A more elaborate system also includes comparisons among subgroups, either at the system level or within units. For example, one may wish to compare performance indicators broken down by gender, income, or racial/ethnic groups. Comparisons could also be made between students with and without disabilities or among types of disabilities.
Finally, the appropriate time frame for accountability information is an issue. Variables to be decided include how often single period information is collected,
whether multiple years are employed, and whether accountability relies on measures of individual student progress over time.
These distinctions yield considerable variation in assessment and accountability systems across states. For example, Tennessee has implemented a "value-added" assessment system that measures changes in classroom-level achievement over time. The system also has the unique characteristic of holding teachers accountable not only for the year they teach the students tested, but also for three subsequent years of student performance after students leave their classrooms. Most state accountability systems, however, hold schools responsible for student performance only in the grades in which state assessments are administered, with comparisons made among grade cohorts (e.g., fourth graders) in different years, rather than of the same students over time.
According to a recent survey of state assessment programs, nearly every state and many school districts and schools now have some kind of assessment-based accountability framework in place (Bond et al., 1996). In 1994–95, 45 states had active statewide assessment programs. Most of the remaining states were in some stage of developing or revising their assessment programs. Two of the states without active assessments (Colorado and Massachusetts) suspended them while they were being revised. Nebraska is developing its first assessment program. Two states had no plans to implement a statewide assessment program of any kind (Iowa and Wyoming).
The assessments that form the basis of these statewide accountability programs are extremely diverse in the content covered, the grades assessed, testing format, and purpose. In general, students are assessed most often at grades 4, 8, and 11; five subjects (mathematics, language arts, writing, science, and social studies) are likely to be assessed. Most states use their assessments for multiple purposes, with the most common based on school- or program-level data: "improving instruction and curriculum" (n = 44), "program evaluation" (n = 39), and ''school performance reporting" (n = 35). Twenty-three states report that they attach consequences at the school level to assessment results; these consequences include funding gains and losses, loss of accreditation status, warnings, and eventual state takeover of schools. Thirty states report that they use individual students' assessment results to determine high school graduation (18 states), grade promotion decisions (5), or awards or recognition (12) (Bond et al., 1996).
ASSESSMENT IN STANDARDS-BASED REFORM
Because standards-based assessments are diverse, it is difficult to generalize about them. Nonetheless, some common themes are discernible.
Dual Purposes—in standards-based reform, large-scale assessment programs usually have two primary, sometimes competing purposes. First, they are expected to provide a primary basis for measuring the success of schools, educators, and students in meeting performance expectations. Second, they are also ex-
pected to exert powerful pressure on educators to change instruction and other aspects of educational practice. In this respect, many current standards-based reforms echo the themes of "measurement-driven instruction" (Popham et al., 1985) that shaped state testing programs during the minimum-competency testing movement of the 1970s and the education reform movement of the 1980s (Koretz, 1992). Current assessments differ from those of previous reform movements, however, in their emphasis on higher standards, more complex types of performance, and systemic educational change.
Externally Designed and Operated—the assessments that are most central to the standards-based reform movement are external testing programs—that is, they are designed and operated by authorities above the level of individual schools, often by state education agencies. Internal assessments—those designed by individual teachers and school faculties—also play an important role in many standards-based reforms; indeed, one explicit goal of some standards-based reforms is to encourage changes in internal assessments. External assessments, however, are typically considered the critical instrument for encouraging changes in practice, including changes in teachers' internal assessments.
Use for Individual or Group Accountability—many large-scale external assessments are used for accountability, although the means of doing so vary greatly. Some assessments have high-stakes accountability for individuals, meaning that individual students' results are used to determine whether a student will graduate from high school, be promoted to the next grade, or be eligible for special programs or recognition. An example is the recently announced high school assessments in Maryland, which will be required for graduation. Other assessments impose serious accountability consequences for educators, schools, or districts but not for students. For example, schools that use aggregated student results to show sufficiently improved performance on the Kentucky Instructional Results Information System (KIRIS) assessments receive cash rewards, and, beginning in 1997, schools that fail to show improvement will be subject to sanctions. In yet other instances, the publicity from school-by-school reporting of assessment results is the sole or primary mechanism for exerting pressure. As we discuss later in this chapter, the method used to enforce accountability—in particular, whether consequences are attached to group performances (schools or classrooms) or individual students—has important implications for the participation of students with disabilities.
Infrequently Administered—in many standards-based systems, the external assessments used for accountability are administered infrequently. For example, Maryland's School Performance Assessment Program (MSPAP) is administered in only three grades (third, fifth, and eighth). Kentucky's KIRIS was originally administered in three grades (fourth, eighth, and twelfth); in the last several years, the assessments have been broken into components that are administered in more grades, but a given component, such as writing portfolios, is still administered in only three grades. These assessments are intended to assess a broad range of
skills and knowledge that students are expected to have mastered by the grades at which they are administered. In this respect, they differ from course-based examinations, such as the College Board advanced placement tests and the former New York Regents examinations, and they contrast even more sharply with various types of assessments given throughout the school year to assess individual progress.
Reporting by Broad Performance Levels—in keeping with the central focus of standards-based reform, these assessments typically employ standards-based rather than normative reporting. That is, student results are reported in terms of how they compare against predetermined standards of what constitutes adequate and exemplary performance, rather than how they compare with the performance of other students in the nation or other distributions of performance. Moreover, the systems typically employ only a few performance standards. For example, Kentucky bases rewards and sanctions primarily on the percentages of students in each school reaching four performance standards (novice, apprentice, proficient, and distinguished) on the KIRIS assessments; Maryland publishes the percentages of students in schools and districts reaching the satisfactory level. In these systems, gradations in performance within one level—that is, between one standard and the next—are not reported. In Kentucky, for example, variations among students who have reached the apprentice level but not the proficient level are not reported.
Reporting of results in normative terms, such as national percentile ranks, is downplayed, although it is not always abandoned altogether. For example, the Kentucky Education Reform Act required that the results of the proposed assessment that has become KIRIS be linked to the National Assessment of Educational Progress to provide a national standard of comparison, and the most recent version of KIRIS will include some use of commercial tests, the results of which are reported in terms of national norms.
Performance Assessment—the standards-based reform movement has been accompanied by changes in the character of assessments to reflect the changing goals of instruction, as discussed in Chapter 4. In an effort to better measure higher-order skills, writing skills, and the ability to perform complex tasks, large-scale assessments are increasingly including various forms of performance assessment, either in addition to or in lieu of traditional multiple-choice testing. The term performance assessment encompasses a wide variety of formats that require students to construct answers rather than choose responses; these include conventional direct assessments of writing, written tasks in other subject areas (such as explaining the solution to a mathematical problem), hands-on tasks (such as science tasks that require the use of laboratory equipment), multidisciplinary tasks, small-group tasks, and portfolios of student work. In some instances, the specific skills or bits of knowledge that would have been assessed by short, specific items in traditional tests are instead embedded in complex tasks that take students a longer time to complete.
CURRENT PARTICIPATION OF STUDENTS WITH DISABILITIES IN ACCOUNTABILITY AND ASSESSMENT SYSTEMS
Although several studies have documented that the participation of students with disabilities in statewide assessments generally has been minimal, it is also extremely variable from one state to another, ranging from 0 percent to 100 percent (Erickson et al., 1995; McGrew et al., 1992; Shriner and Thurlow, 1992). Inconsistent data collection policies make it difficult to compare participation rates from place to place or to calculate a rate that has the same meaning across various locations; in addition, states tend to use methods that inflate the rates (Erickson et al., 1996).1
Forty-three states have written guidelines about the participation of students with disabilities in state assessments. Most states rely to some extent on the IEP team to make the decision, but only about half the states with guidelines require that participation decisions be documented in the IEP (Erickson and Thurlow, 1996). A number of other factors also affect (and often complicate) these decisions, including vague guidelines that can be interpreted in a variety of ways, criteria that focus on superficial factors rather than on student educational goals and learning characteristics, and concerns about the potentially negative emotional impact of participation on the student (Ysseldyke et al., 1994). In addition, anecdotal evidence suggests other influences, such as pressures to keep certain students out of accountability frameworks because of fears that these students will pull down scores.
In most states, nonparticipation in the assessment means that students are also excluded from the accountability system (Thurlow et al., 1995b). Indeed, in many states, even some students who participate in a statewide assessment may still be excluded from "counting" in the accountability framework. Sometimes states or school districts simply decide to exclude from aggregated test scores any students who are receiving special education services (see Thurlow et al., 1995b). For example, in one state the scores of students with IEPs who have taken the statewide accountability assessment are flagged and removed when aggregate scores are calculated for reporting back to districts and to the media; the scores of the students with IEPs are totaled separately and given to principals, who then can do with them what they wish (which frequently means discarding them). Unfortunately, these practices, if unmonitored, may lead to higher rates of exclusion of students with disabilities from accountability frameworks, particularly when incentives encourage exclusion (e.g., if high stakes are associated with aggregated test scores without regard to rates of exclusion). In fact, researchers (Allington and McGill-Franzen, 1992) have demonstrated that the exclusion of
students with disabilities from high-stakes assessments in New York has led to increased referrals to special education, in part to remove from accountability decisions students who are perceived to be performing at low levels.
One of the avenues for increasing participation of students with disabilities in assessments is allowing accommodations. Accommodations currently in use fall into four broad categories (Thurlow et al., 1993). Changes in presentation include, for example, braille forms for visually impaired students and taped versions for students with reading disabilities. Changes in response mode include use of a scribe or amanuensis (an individual who writes answers for the examinee) or computer-assisted responses in assessments that are not otherwise administered by computer. Changes in timing include extra time within a given testing session and the division of a session into smaller time blocks. Changes in setting include administration in small groups or alone, in a separate room. In addition, some students with disabilities may be administered an assessment in a standard setting with some form of physical accommodation (e.g., a special desk) but with no other alteration.
Within the past five years, increasing numbers of states have written guidelines outlining their policies on the use of accommodations. In 1992, 21 states indicated they had written guidelines on the use of accommodations by students with disabilities in their statewide assessments; by early 1995, 39 states had such written guidelines (Thurlow et al., 1995a).
An analysis of these state accommodations guidelines found a great deal of variation in their format and substance (Thurlow et al., 1995a). Some are one sentence long, and others take up numerous pages. States use diverse terms (e.g., nonstandard administration, mediation, modification, alteration, adaptation, accommodation), sometimes indistinguishably. Some states vary their guidelines depending on the type or purpose of the assessment, and others use the same guidelines for all purposes. States also classify accommodations in different ways: by category of disability, by the specific test being administered, or by whether accommodations are appropriate for score aggregation.
Perhaps most important, states take different approaches regarding which accommodations they allow or prohibit and how they treat the scores of students with disabilities who use accommodations. An accommodation that is explicitly permitted in one state might be excluded in another (Thurlow et al., 1995a). This variation is not surprising, given that little research exists on the impact of specific accommodations on the validity of various elementary and secondary achievement tests (Thurlow et al., 1995d). States also have divergent policies about whether to include the scores of students with disabilities who used accommodations in assessment-based accountability frameworks. Some states exclude these scores because of concerns about their validity (Thurlow et al., 1995a).
Despite the variability of state guidelines on accommodations, some generalizations can be made (Thurlow et al., 1995a). First, the majority of states with guidelines (n = 22) recognize the importance of the IEP and the IEP team in
making decisions about accommodations for individual students. Second, many states (n = 14) specifically refer to a link between accommodations used during assessment and those that are used during instruction. Third, relatively few states (n = 4) require written documentation about assessment accommodations beyond what is written in the IEP. Even without such a requirement, however, many state assessment directors still document the use of assessment accommodations. A 1995 survey found that 17 of the 21 states that collect data on individual students with disabilities in their statewide assessment database also document whether an individual student used an accommodation. Not all of these states, however, can identify exactly which accommodations a student used (Elliott et al., 1996a; Erickson and Thurlow, 1996).
In most states, the net effect of policies on exclusion and accommodation is to keep at least some students with disabilities out of the accountability framework. However, most states are now reviewing the participation of students with disabilities in their assessment and accountability systems and the use of accommodations.2 (We examine the design of assessment accommodations in the next section of this chapter.)
The large-scale assessments that typify standards-based reform are in many ways unlike those typically used in special education. Although including students with disabilities in these assessments may benefit them, the assessments themselves are not designed to manage the instruction delivered to individual students with disabilities.
Large-scale assessments are not intended to track the progress of individual students. Assessments are infrequent; often they are administered late in the school year, so that teachers do not see results until the following school year. They are not designed to provide longitudinal information about the progress of individual students, and in many cases, they do not place results from different grades on a single scale to allow measurement of growth over time. In fact, some large-scale assessments used in standards-based reform are also not designed to provide high-quality measurement of individual performance; measurement quality for individuals was deliberately sacrificed in the pursuit of other goals, such as broadening the content coverage of the assessment for schools and incorporating more time-consuming tasks.
Moreover, unlike some assessments used to manage special education, the large-scale assessments in standards-based reform focus on high-performance standards that are applied without distinction or differentiation to most students, including low-achieving students and students with disabilities. In contrast, a fundamental tenet of the education of students with disabilities is individualization and differentiation, as reflected in the IEP. The educational goals for each student with disabilities are required to reflect his or her capabilities and needs, as
should the instructional plan and assessments. For example, one IEP may call for a sign language interpreter to assist a student in the advanced study of history; another IEP may call for training in basic functional skills, such as telling time. Standards-based reform calls for uniformity in outcomes, allowing educators variations only in the path to those ends.
Many of the new, large-scale assessments deliberately mix the activities and modes of presentation required by individual tasks to mirror real-world work better and to encourage good instruction. A task may require substantial reading and writings as well as mathematical work, group work as well as individual work, or hands-on activities as well as written work. This mixture of modes is a reaction against the deliberately isolated testing of skills and knowledge found in traditional tests. But the instructional programs of many students with disabilities focus on developing very special skills, which are tested most effectively with narrowly focused tasks.
The methods used to report assessment results may limit their utility for tracking the progress of some students with disabilities. New large-scale assessments typically use only a few performance levels, wherein the lowest level is high relative to the average distribution of performance. Consequently, no information is provided about modest gains by the lowest-performing students, including some students with disabilities, and the reporting rubric signals that modest improvements are not important unless they bring students above the performance standard. Ideally, the tests used in special education should track the kinds of modest improvements that a student can reasonably achieve in the periods between measurements.
INCREASING THE PARTICIPATION OF STUDENTS WITH DISABILITIES IN LARGE-SCALE ASSESSMENTS
Including more students with disabilities in large-scale assessments in a way that provides meaningful and useful information will require confronting numerous technical issues. In addition, these assessments must be designed and implemented within the legal framework that defines the educational rights of students with disabilities and with consideration of the resources that new assessments will require for development, training, and administration. We address these technical and political issues in this section.
Assessment programs associated with standards-based reform should satisfy basic principles of measurement, regardless of whether the assessment is traditional, performance-based, or a combination. Performance assessments, which comprise the bulk of standards-based assessments, are relatively new, and empirical evidence on their quality, although growing, is limited. Nonetheless, mea-
surement experts agree that "performance assessments must be evaluated by the same validity criteria, both evidential and consequential, as are other assessments. Indeed, such basic assessment issues as validity, reliability, comparability, and fairness need to be uniformly addressed for all assessments because they are not just measurement principles, they are social values that have meaning and force outside of measurement wherever evaluative judgment and decisions are made" (Messick, 1994:13, cited in Linn, 1995).
These basic principles hold regardless of whether the students to be assessed receive general education services only or also receive additional services due to economic disadvantage (e.g., Title I) or disability. The importance of complying with basic measurement principles was acknowledged by Congress when it amended the Title I assessment requirements (P.L. 103-382). The amendments require states to develop or adopt for Title I challenging content standards, challenging performance standards that define advanced, proficient, and partially proficient levels of performance, and high-quality yearly assessments in at least reading and math. The law requires that these assessments:
be used only for purposes for which they are reliable and valid;
be consistent with nationally recognized professional and technical standards; and
make reasonable adaptations for students with diverse learning needs.
To date, the evidence suggests that creating performance assessments that are reliable and valid remains a challenging and costly endeavor. Measurement error consistently is greater than that associated with traditional tests (Koretz et al., 1994). The ability to make generalizations from a limited number of performance tasks about students' competence in performing purportedly similar tasks is problematic (Breland et al., 1987; Dunbar et al., 1991; Gao et al., 1994).
By design, many performance assessments associated with standards-based reform require students to integrate a variety of knowledge and skills to produce a product or performance. Thus, for example, performance assessments in the area of mathematics are likely to involve reading and writing in the context of problem solving. In theory, this approach increases the probability that reading or writing disabilities, which are among the most common, will interfere with the assessment of mathematics. A similar situation exists for assessments of other areas, as when demonstrating knowledge of a scientific principle requires writing or relatively complex physical manipulation compared with a traditional multiple-choice item. Thus, creating reliable and valid performance assessments for students in general will be a challenging endeavor, and creating assessments that also are reliable and valid for students with disabilities is likely to be even more difficult. Empirical data are needed to inform the design of performance assessments and their use with students with disabilities.
"Reliability" refers to the consistency of performance across instances of measurement—for example, whether results are consistent across raters, times of measurement, and sets of test items.3
Reliability takes many forms, and the choice of both a measure of reliability and an acceptable level of reliability depends on how scores will be used. Three factors that influence reliability—the sampling of students, variation among tasks, and excessive difficulty levels—are particularly important in the assessment of students with disabilities.
Sampling of Students
In standards-based reform, assessments are often used to make inferences about groups—for example, changes in the percentage of a school's students who have reached standards. A key aspect of reliability for measures of group performance is sampling error—that is, the variation in results from one group of students to the next.
Sampling error can be a concern for two reasons. First, in some standards-based systems, results are drawn from only a sample of students rather than the entire population in a given grade. For example, state-level reporting of portfolio scores in Vermont is based on random samples of students drawn from each school. Second, even when all students are tested, sampling error is important if inferences are going to be made beyond the performance of those particular students—for example, to compare them to another group. Such inferences are often made in standards-based systems, as when trends over time are used to judge schools' progress in meeting educational goals. Thus, for example, each successive group of third graders is compared with the one before to see if progress is being achieved. However, due to sampling error, some of the year-to-year change in a school's scores will reflect differences in the characteristics of the particular groups (or cohorts) of third grade students rather than changes in educational effectiveness within a school: a given year's third graders may be easier or harder to teach than the prior year's.
Sampling error is inversely related to the number of students included in the group's total (or aggregated) score: the fewer students included, the more unstable or unreliable the results. Accordingly, results for small schools (e.g., elementary schools and small rural secondary schools) or for small groups, such as students with disabilities, are likely to be less reliable and more subject to fluctuations due to random characteristics of individual members of the group. Sam-
pling error will be a particularly serious concern if assessment results are to be reported separately for students with disabilities because of their relatively small numbers. The problem of sampling error is likely to be compounded further when scores need to be reported for students with different types of disabilities—both because their numbers will be even smaller and because, as discussed in Chapter 3, categories of disability are so variable. These issues are discussed further in the section on reporting.
Variation Among Tasks
Another aspect of reliability that may be particularly important in the assessment of students with disabilities is "task variance": differences in performance across tasks in an item pool. Evidence suggests that, particularly when tasks are complex, the relative performance of students at the same level of proficiency in a domain (say, students with roughly equal facility in writing) tends to vary markedly from one task to the next (Dunbar et al., 1991; Shavelson et al., 1993). This variation is a particular concern in the case of some performance assessments, in which tasks are complex and the relatively small number of them makes it difficult to average out the variations among tasks.
A more serious issue is the possibility of student-by-task interactions: differences in the ranking of students from one task to the next, independent of the average difficulty of the tasks. Although empirical evidence is not yet available, it is possible that this problem may be exacerbated for some students with disabilities because the irrelevant attributes of individual tasks may have greater impact on their performance than on the performance of many students without disabilities. For example, in a performance assessment of science, one task may require the student to manipulate objects, making that task especially difficult for students with orthopedic disabilities; other tasks may not require any physical manipulation. The ranking of some students with orthopedic disabilities on the first task is likely to be markedly lower than on other tasks, but this score would not reflect their true understanding of science.
Excessive Task Difficulty
Generally constraints on the reliability of performance assessments include the relatively few numbers of tasks and scoring categories and the subjective nature of scoring. An additional constraint is at work for students who score particularly high or low. Although traditional test theory assumes that error is constant along the full range of test scores (Green et al., 1984), in fact, measurement precision varies as a function of level of performance for most tests (Thissen, 1990). Most tests provide more precise estimates for average performers and less precise estimates for either low or high performers (Hambleton and Swaminathan, 1985).
A defining feature of assessments associated with standards-based reform is that the performance standards are set at high, "world class" levels (Linn, 1995).
Results from California, Kentucky, and Maryland show that there is a substantial gap between the performance levels specified in the standards and the actual, current levels of performance for general education students. The gap is likely to be even wider for many low-achieving students, such as some students with disabilities and those participating in Title I programs.
The result of this gap will be a decrease in the reliability of information about low-achieving students, including some with disabilities, and a corresponding decrease in the reliability of some scores and performance reports. If assessments are used only to estimate the proportion of students reaching a high performance standard, this additional unreliability will not be a serious concern; the assessment will correctly identify most of the low-performing students as failing to reach the standard. This source of unreliability could be important, however, if assessments are used to provide other information about the performance or progress of low-achieving students, such as changes in their mean scores, changes in the proportion reaching lower performance thresholds, or information about the performance of individual low-performing students.
Consideration of these three factors that influence reliability—sampling of students, variation in performance among tasks, and inappropriate difficulty levels—suggests that in particular cases the reliability of scores obtained by some students with disabilities may be lower than the reliability of results for the general student population. Additional empirical evidence is needed, however, to explore this possibility. When feasible, reliability should be examined empirically for specific groups and for particular test uses. When appropriate reliability studies are not feasible, results for students with disabilities should not be assumed to be as reliable as those for other students.
Furthermore, the relative importance of these reliability considerations are greatly influenced by whether inferences are being made about individuals or about groups. The reliability of test scores for individual students is critical when scores are used to make decisions about instructional placements or receipt of a diploma, but it is often relatively unimportant when scores are aggregated to characterize the performance of large groups. Conversely, the sampling of students is irrelevant when a score is used only to draw inferences about the individual who has taken the test, but it can be a major source of unreliability when scores are aggregated to describe the performance of small groups.
Two general themes become apparent in considering the validity of assessments of students with disabilities.4 The first is that the degree of validity for students with disabilities may not be similar to that for other students. For ex-
ample, the potentially lower reliability noted above could threaten the validity of inferences about students with disabilities.
The second theme tempers the first: there is a severe shortage of empirical research pertaining to the assessments of students with disabilities. A decade and a half ago, a National Research Council panel noted (Sherman and Robinson, 1982:141): "Almost from the promulgation of the Section 504 [of the Rehabilitation Act of 1973] regulations, it has been clear that there exists insufficient insufficient information to allow for the demonstrably valid testing of handicapped people, as required by the regulations." Although some research has been conducted since then, the National Research Council's generalization remains more true than not today, particularly for students younger than high school juniors and seniors for whom there is almost no empirical evidence. Both measurement theory and the available research raise concerns about the validity of assessments of students with disabilities. Research tailored more directly to elementary and secondary school students participating in the kinds of assessments used in standards-based reform is urgently needed to evaluate these concerns.
Link Between Use and Validity
Although people often speak of "valid" or "invalid" tests, validity is not an attribute of a test per se. Rather, validity is an attribute of a specific inference or conclusion based on a test. A valid inference is a conclusion well supported by the results of a given assessment; a less valid inference is poorly supported. To say, for example, that a given test is a valid measure of high school algebra means that conclusions about mastery of high school algebra are well supported by scores on that test. Therefore, validity depends on a test's particular uses and the specific inferences it is used to support.
The assessments at the core of standards-based reform have various functions, but, in one way or another, they are most often used to determine whether groups of students have reached acceptable levels of educational achievement, as embodied in explicit performance standards. They are generally not used for many of the traditional purposes governing assessments of students with disabilities. For example, the new assessments are generally not used for making decisions about individual students, including diagnosing disabilities, monitoring short-term progress toward IEP goals, or making placement decisions. 5 Accord-
ingly, some of the considerations of validity that are important for these traditional uses are less important in standards-based reform. For example, a central concern that arises when tests are used for individual decision making is the frequency of misclassifications—such as placing a child in the wrong instructional program—stemming from measurement error (Shepard, 1989; Taylor and Russell, 1939). This is not a concern when scores are reported only for groups such as schools and districts.6 As long as the assessments used in standards-based reform are not used to determine placements, they will not raise this issue.
As explained earlier, some assessments for standards-based reform have consequences for individual students, whereas others are intended to monitor the educational achievement of groups of students at the school, district, or state levels. Assessments aimed at measuring group performance, such as KIRIS and MSPAP, have been designed to optimize measurement at the aggregate level, at the cost of precluding high-quality measurement of individual students.7
The unit of accountability (group or individual) has substantial implications for validity. For example, in the case of aggregate measurement, low reliability of scores for individual students may pose little or no threat to the validity of the intended inferences, and inappropriateness of tasks for some students may become much less important. Sampling error, however, which is irrelevant to individual measurement, may become a serious threat to the validity of inferences about groups, particularly in the case of small groups.
The fairness of assessments of students with disabilities is another issue related to validity that warrants consideration. Like other aspects of validity, fairness hinges fundamentally on how scores are used. In particular, assessments that are not fair when individual students are held accountable for scores may be fair when schools or districts are the unit of accountability. Suppose that two hypothetical schools, A and B, have similar populations of students with disabilities. School A has been diligent in searching for the most appropriate and effective instructional approaches for its students with disabilities, but School B has not. In a system that imposed high stakes for individual test scores, the students with disabilities in School B would be unfairly disadvantaged, perhaps to the extent that the assessment would fail to meet ethical and legal scrutiny. However, if scores are used to reward and penalize staff, not students, the low scores of students with disabilities in School B would be fair, in that they would accurately reflect bad practice and would lead to negative consequences for the staff. Temporarily exempting students with disabilities in School B from consequences for their low scores would increase the fairness of the system, whereas exempting their teachers from consequences would decrease its fairness.
Attaching high stakes to test results, as many states plan to do, has several general implications for the validity of these assessments. First, high-stakes tests are typically held to higher standards of quality because the consequences of incorrect test-based decisions are substantial. Validity evidence that would be deemed sufficient in the case of low-stakes assessments will often be judged insufficient in the case of high-stakes assessments. This suggests that the lack of validity evidence for some students with disabilities will be an even more pressing concern when stakes are high.
Second, evidence indicates that, when consequences are imposed for test performance, scores may become inflated (Koretz et al., 1991). High stakes increase incentives to teach to the test, sometimes to the point of deemphasizing important aspects of the domain (e.g., biology, algebra) that should be taught but that the test does not directly measure. The result is that test scores can increase even when performance has not increased across the whole domain.
Because the assessments of standards-based reforms are generally used to support the same inferences for all students—that is, whether students have reached performance standards—the comparability of test results for students with and without disabilities is a critical aspect of validity.
Comparability of test scores has many meanings, and the movement toward standards-based reform and performance assessment has made the issue of comparability even more complex. Recent papers suggest the need for caution in inferring comparability, especially for results from different assessments (Haertel and Linn, 1996; Linn, 1993; Mislevy, 1992). Two conclusions from this work are particularly important for the present discussion. First, even when the results of different assessments are linked by statistical methods, justifiable inferences about the comparability of performance across assessments may be severely limited. Second, an approach that makes results more comparable for one purpose may actually degrade comparability for another. For example, under many circumstances, linking two assessments in a manner that improves the comparability of estimates for groups distorts estimates for individual students (Mislevy, 1992).
In standards-based reforms, the central issue of comparability could be called "performance comparability": the degree to which similar scores obtained by students with and without disabilities support similar inferences about their current level of achievement with respect to performance standards. Performance comparability is questionable whenever students with disabilities are administered assessments that differ appreciably from those administered to peers without disabilities.
A number of factors may influence comparability. Some students with disabilities will be administered assessments that differ only modestly from those administered to other students, for example, in the provision of slightly more time
to complete the assessment; in other instances, the assessments administered to students with disabilities will be considerably altered. Performance comparability is likely to be problematic for students with very low levels of academic performance, such as many students with severe cognitive disabilities. Within a reasonable range, one can have measures that differ in difficulty but measure the same construct. When differences in performance are very large, however, the tests may not be measuring the same underlying construct. Performance comparability is also likely to be affected when disabilities are related to the construct measured. This issue, which potentially affects a substantial percentage of students with disabilities, is revisited below in the discussion of accommodations.
Performance comparability may also be particularly difficult to evaluate and document because of the difficulty of obtaining concurrent indicators of achievement against which to validate a given score. Other performance measures, such as grades or performance on other assessments, are likely to be similarly affected by a student's disability or the accommodations provided. Even when information on later performance would be useful for this purpose, it is likely to be scarce. For example, information about later performance in postsecondary education or employment is generally unavailable for representative groups of students.8
Three clear implications emerge from this discussion of score comparability. First, one should be very cautious in assuming that the results of any assessment that differs from the common one in format or testing conditions are comparable in meaning to those of the common assessments. Second, comparability of meaning may hinge on the specific inferences the assessment is used to support; comparability for one purpose may not indicate comparability for another. Third, additional empirical exploration of the comparability of results from modified assessments for standards-based reform is badly needed.
Context Dependence of Performance
The performance of an individual in a given domain tends to vary across contexts, sometimes in idiosyncratic ways. For example, the proficiency of some people in writing or mathematical computation falls off markedly when they are placed under time pressure, whereas other people are less affected. This is one of many reasons why measurement experts caution against drawing broad inferences from a single measure of performance.
It is possible that the performance of some students with disabilities may be particularly affected by contextual differences, which could interfere with the validity of inferences. For example, some students have disabilities that cause them to work somewhat slowly on certain tasks, thereby making their perfor-
mance more susceptible to time pressure than that of many other students. An assessment that is a "power" test for most students (that is, performance is constrained by knowledge and skill but rarely by time limits) may be a "speed" test (where speed of performance limits scores) for many students with disabilities, and the provision of additional time is one of the most frequently offered testing accommodations for students with disabilities. As Bennett (1995) has pointed out, a test that is speeded for students with disabilities but not for other students may measure different attributes for the two groups. 9
Contextual differences may constrain the justifiable inferences that can be reached about some students with disabilities on the basis of assessments for standards-based reform. The explicit inferences based on these assessments tend to be very broad, along the lines of "35 percent of fourth graders reached the minimally acceptable standard in writing"; these are oversimplications for students in general education but may be untenable for some students with disabilities. At this time, however, there is little empirical basis for judging when contextual effects are particularly problematic for students with disabilities.
Relationship Between Disabilities and the Constructs Measured
Many approaches to the assessment of individuals with disabilities, particularly assessment accommodations, assume that disabilities are not directly related to the construct tested. Case law indicates that rights to accommodations do not apply when the disability is directly related to the construct tested (see Phillips, 1994). In other words, a student with a reading disability might be allowed help with reading (the accommodation) on a mathematics test, since reading is not in the construct being measured, but would not be allowed help with reading on a reading test, since the disability is directly related to the construct of reading.
However, the groups of students with clearly identifiable disabilities (such as motor impairments) that are largely unrelated to the constructs being tested constitute a small number of the identified population of students with disabilities. Most students with disabilities have cognitive impairments that presumably are related to at least some of the constructs tested.
Relationships between disabilities and assessed constructs have important implications for the validity of inferences based on test scores. For example, if a new assessment includes communication skills as an important part of the do-
main of mathematics, then, to score well in mathematics, students would need to be able to read and write reasonably well.10 On such an assessment, it is possible that students with reading disabilities might score worse than their proficiency in other aspects of mathematics would warrant, but providing them with accommodations such as the reading of questions or the scribing of answers is likely to undermine the validity of inferences to the broader, more complex domain of mathematics.
Several factors are likely to complicate efforts to evaluate this problem and to decide, for example, how best to use accommodations to maximize validity. First, as already noted, many performance assessments deliberately mix constructs and modes of response, making it more difficult to segregate the specific skills involved, especially those pertinent to a given disability. Second, the inconsistent classification of students with cognitive and learning disabilities does not provide clear criteria for describing the characteristics of various categories of disability, thus making guidelines for valid accommodations problematic (see Chapter 3).
Assessment Modifications and Accommodations
Tests are often altered in response to individuals' disabilities. For example, a blind individual cannot take a test that is normally presented in printed form unless it is presented orally or in braille. Such alterations are intended to remove irrelevant barriers to performance and allow the individual to demonstrate his or her true capabilities. As increasing percentages of students with disabilities are included in large-scale assessment programs, requests for such accommodations are likely to become more frequent. However, research on alterations of assessments for elementary and secondary school students is extremely sparse and provides only limited guidance for policy makers and educators.
Types of Alterations
Assessments are altered for individuals with disabilities in numerous, diverse ways, and the terminology used to describe these alterations is not always consistent. For purposes of this discussion, we will distinguish among (1) accommodations, (2) modifications, and (3) the substitution of different assessments. The distinction among these three categories is not always clear, and other classifications are also in use. This classification nonetheless helps to clarify the issues that arise in using altered assessments.
We label as accommodations changes in assessments intended to maintain or
even facilitate the measurement goals of the assessment. Accommodations are generally intended to offset a distortion in scores caused by a disability, so that scores from the accommodated assessment would measure the same attributes as the assessment without accommodations administered to individuals without disabilities. But, like any alteration in standardized administration procedures, accommodations may alter what an assessment measures, even when it appears on its face not to do so.
We use modification to refer to alterations of the content of an assessment.11 Most content modifications are likely to change what a test measures. For example, educators may delete from an assessment specific items, subtests, or tasks that are deemed inappropriate or impractical for a specific examinee, or they may replace such a task with an alternative that would be more reasonable for that individual.
One common type of modification of tests is the administration of easier forms intended for younger children ("out-of-level testing"). Under certain restrictive conditions, out-of-level testing may preserve the measurement functions of an assessment, but it is unlikely to do so in the case of many standards-based assessments.12 Moreover, testing that is substantially out of level may not produce comparable results (Plake, 1976), and out-of-level testing may be problematic in subjects in which curriculum content differs markedly across grades. Moreover, standards-based assessments are typically not constructed, administered, or reported in ways that would help preserve their measurement functions if administered out of grade. Perhaps most important, they are typically reported in terms of standards that are set within grades and are not linked between grades. Accordingly, in the case of the assessments used in standards-based reform, it is safest to consider out-of-level testing to be a modification that threatens performance comparability, not an accommodation that has the potential to maintain or even enhance it.
Finally, in some instances, students with disabilities may be administered different assessments rather than accommodated or modified versions of the same assessments administered to other students. These different tests may or may not be related conceptually to the regular assessments, but they are constructed as distinct assessments. Examples include Kentucky's alternative portfolio assessments and Maryland's Independence Mastery Assessment Program (IMAP) assessment, both of which are administered to a small percentage of students with
disabilities who meet specific requirements for exemption from the states' regular assessments (Box 5-1).
These latter two kinds of alterations, modified and different assessments, typically will not support the same inferences as the regular assessments administered to students without disabilities. The measurement question raised by these altered assessments is the degree to which they support similar inferences. For example, they may be able to support only a subset of the inferences supported by the regular test, or inferences about some of the same content standards but not the same performance standards, or weaker forms of the same inferences, or, in some cases, they may be unable to support similar inferences at all. Which of these is true depends on the specific inferences at issue and the particular attributes of the modified or different assessments.
Accommodated assessments, in contrast, should have at least the potential for supporting the same inferences as regular assessments. Accommodations are widely viewed as the best means for increasing participation of students with disabilities in assessments. Accordingly, the design and evaluation of accommodated assessments entail a number of difficult conceptual and empirical issues, which are discussed in the following sections.
Logic of Accommodations
Traditionally, standardization (of content, administrative conditions, scoring, and other features) has been used to make the results of assessments comparable in meaning from one test-taker to the next. For some students with disabilities, however, a standard assessment may yield scores that are not comparable in meaning to those obtained by other students because the disability itself biases the score. In many cases, students with disabilities would get a lower score than they should because the disability introduces construct-irrelevant variance, variations in the scores unrelated to the construct purportedly measured. Therefore, "in the case of students with disabilities, some aspects of standardization are breached in the interest of reducing sources of irrelevant difficulty that might otherwise lower scores artificially" (Willingham, 1988a:12).
Accommodations are intended to correct for distortions in a student's true competence caused by a disability unrelated to the construct being measured (Box 5-2). The risk of accommodations is that they may provide the wrong correction. They may provide too weak a correction, leaving the scores of individuals with disabilities lower than they should be, or they may provide an irrelevant correction or an excessive one, distorting or changing scores further and undermining rather than enhancing validity. This risk is explicitly recognized in the guidelines provided by some state education agencies for avoiding these errors, although their guidance is sometimes very general and limited. For example, Maryland's Requirements and Guidelines for Exemptions, Excuses, and Accommodations for Maryland Statewide Assessment Programs (Maryland State Department of Edu-
BOX 5-1 Alternate Assessments for Some Students with Disabilities
Most students with disabilities have mild disabilities and therefore will be able to participate in state assessment programs, although some will require accommodations. A much smaller percentage of students with disabilities require an alternate or different assessment because their curriculum does not match the content and performance standards assessed by the common test. Section 504 of the Rehabilitation Act of 1973 and the Americans with Disabilities Act require states and school districts to provide different assessments to the limited number of students who cannot otherwise participate effectively in the common assessment program. Goals 2000 and Title I also require administration of alternate assessments.
For students with such severe cognitive impairments that they require different assessments to measure different content, an "equally effective" aid, benefit, or service "must … afford [disabled] persons equal opportunity to obtain the same result, to gain the same benefit, or reach the same level of achievement" (34 CFR and 104.4[b]). The aid, benefit, or service would be the opportunity to participate, albeit by utilizing a different assessment that will provide these particular students "the same result" (Ordover et al., 1996:71–72).
The concept of a state-level alternate assessment for those students who cannot participate in the general assessment system was first proposed and developed by Kentucky. The alternate assessment is designed for those students with the most severe cognitive disabilities—in other words, students for whom traditional paper-and-pencil tests and certain performance events would be inappropriate. In Kentucky, the alternate assessment is a portfolio system in which information is kept on the student's progress toward academic expectations. The information in the portfolio can take many forms, including paper documents, recordings or videotapes, and pictures. Students' portfolios are rated using the same rubric as for the performances of students in the regular assessments—novice, advanced, proficient, and distinguished—and aggregated along with the scores of all other students.
Maryland also has an alternate assessment system for the subset of students with disabilities who are working on different standards from most students. These students are working on standards in four content domains (personal management, community, career/vocational, and recreation/leisure) and four learner domains (communication, decision making, behavior, and academic). The alternate assessment system is called
IMAP (Independence Mastery Assessment Program). Students in the alternate system complete a variety of performance tasks as well as a portfolio of their best work. The nature of the performance tasks and the contents of the portfolios are defined by the IMAP. Maryland's IMAP is still being field tested and its eventual use for accountability is questionable at this time.
Several aspects of the notion of an alternate assessment reflect larger issues that surround the whole concept of increased participation of students with disabilities in assessments associated with standards-based reform. The major issue is defining who will participate in the alternate assessment rather than the common standards-based assessment system. Kentucky's definition limits this group to a relatively small number. This policy is reinforced by state admonitions that the percentage of students participating in the alternate probably should not exceed 2 percent of the student population and that, if it does exceed this percentage, an audit will be performed to make certain that students are not inappropriately being moved into the alternate assessment. During its initial year, only 0.5 percent of students with disabilities were placed in the alternate assessment. In Maryland, the definition of who participates in IMAP is less clear. The state does generally admonish that the percentage of students not participating in the general assessment system should be small; the percentages are published along with test results.
Only six states currently offer students with disabilities alternates to the common assessment (Bond et al., 1996), and no research exists on either the ability of alternate assessments to measure students' educational progress validly or to encourage greater accountability for students with disabilities. We do know, however, that the design of alternate assessments poses all the same technical challenges as the development of valid accommodations for the common assessment. Nevertheless, alternate assessments remain a promising strategy for expanding the participation of students with disabilities in the public accountability system, even for those unable to take the common assessment. However, if accountability is a primary reason for expanding the participation of students with disabilities in state assessments, then it is important that those who take alternate assessments are also accounted for publicly. The scoring rubric may not be the same for an alternate version as for the common assessment, thus making it difficult to report comparable data. Still, the criteria used to assign students to the alternate assessment should be well defined and the number of students taking that test publicly reported.
BOX 5-2 Accommodation as a Corrective Lens
A useful metaphor for understanding accommodations is that of a corrective lens. Even in the absence of disabilities or other complicating factors, tests are imperfect measures of the constructs they are intended to assess. Envision a student's true competence in reading, for example, as a point on a vertical scale. To one side, is an identical scale of that student's observed competence, as reflected by performance on an assessment. Between the two scales is a lens causing some diffraction, so that true competence is represented (over repeated measurements) by an array of points on the observed-competence scale that form a blurry image of the true, unmeasured competence. If the test is well designed, this image will be centered on the true value (i.e., it will be unbiased), and it will not be too blurry (i.e., it will be reliable). Standardization of assessment methods and procedures is a key to obtaining a reasonably good image of the true attribute. Without standardization, some individuals will obtain scores that are inaccurate because of irrelevant factors, such as being given different amounts of time to take the assessment or having their work scored according to different criteria. This could increase the blurriness of the image and bias it.
Accommodations are based on the premise that, for some individuals with disabilities, the logic of standardization is misleading and scores obtained under standard conditions provide a distorted view of the true attribute of interest. The average for a group may be lower than it should be (or biased downward) and the scores for individuals will be biased to various degrees, depending on such factors as the severity of the disability, the coexistence of multiple disabilities, or perhaps less familiarity with assessments. Accommodations are intended to function as a corrective lens that will deflect the distorted array of observed scores back to where they ought to be—that is, back to where they provide a more valid image of the performance of individuals with disabilities.
cation, 1995) says that ''accommodations must not invalidate the assessment for which they are granted" (p. 2, emphasis in the original). However, the only guidance it provides for meeting this standard is a single pair of examples (p. 3):
Addressing the issue of validity involves an examination of the purpose of the test and the specific skills to be measured. For example, if an objective of the writing test is to measure handwriting ability, that objective would be substantially altered by allowing a student to dictate his/her response. On the other hand, if a writing objective stated that the student was to communicate thoughts or ideas, handwriting might be viewed as only incidental to achieving the objective. In the latter case, allowing the use of a dictated response probably would not appreciably change the measurement of the objective.
Unfortunately, many cases will be far less clear than this, and accommodations may not succeed in increasing validity even when they seem clear and logical on their face.
Designing and Evaluating Accommodations
To design an accommodation that will increase the validity (meaningfulness) of scores for students with disabilities, one must first identify the nature and severity of the distortions the accommodation will offset. These distortions depend on the disability, the characteristics of the assessment, the conditions under which the assessment is administered, and the inferences that scores are used to support.
Research has shown that different disabilities can cause different distortions in scores. Ragosta and Kaplan (1988), for example, surveyed students with disabilities about their experiences taking the Scholastic Aptitude Test (SAT) and the Graduate Record Examination (GRE). Although the poor response rate to the survey makes generalization risky, respondents' answers indicated that different disability groups face different difficulties in taking tests.13 One blind student explained that items requiring extensive reading were particularly difficult for him, even when given a braille test, because "braille does not permit skimming" (Ragosta and Kaplan, 1988:62). Bennett et al. (1988) showed that time pressure varied for test-takers with disabilities depending on their disability and the accommodations they were offered. They also showed that, in the case of the SAT, unexpected differential item performance (that is, items that were relatively too hard or too easy for examinees, given their overall performance) were generally rare for most students with disabilities but were more common for blind students taking braille examinations.
However, predicting distortions in scores on the basis of disabilities is likely to be more difficult and controversial for elementary and secondary school students with disabilities than for the older students in the aforementioned research studies. One reason is the ambiguity of many classifications. Much of the case law and research pertaining to accommodations has focused on disabilities that are fairly unambiguous in terms of both diagnosis and functional implications, such as visual, hearing, and physical disabilities. In contrast, many of the students currently identified for special education have disabilities—in particular, learning disabilities—that do not have clear or consistently used diagnostic criteria or characteristics, as explained in Chapter 3. The classification of students
with certain disabilities has been inconsistent among school jurisdictions and over time, and the classification of students in school settings is often inconsistent with research or clinically based definitions (Bennett and Ragosta, 1988; Shepard, 1989; Willingham, 1988b; Lyon 1996). The co-occurrence of more than one disability, which is common, clouds classification even further.
Because disability classifications tell us who may have underlying functional characteristics that are linked to potential score distortions, ambiguities or inconsistencies in classifying students with disabilities have serious implications for assessments. To the extent that a disability classification is valid for a particular student, then testing accommodations can be selected that offset any potential score distortions resulting from the student's disability, without compromising assessment data about performance on the domains measured. However, if classification of a disability is incorrect or imprecise, determining whether the accommodations selected are valid will be difficult.
A second source of difficulty, as previously noted, is that many elementary and secondary school students with disabilities have cognitive disabilities that are related to the achievement constructs being measured. The decrease in the reported prevalence of mental retardation (MR) and increase in the reported prevalence of specific learning disabilities (SLD) in recent years underscores this problem. In certain cases, a low score may be accurate for a student with mental retardation but misleadingly low for certain students with specific learning disabilities, and an inability to distinguish between them reliably clouds the interpretation of their test scores.
Efforts to identify the links between disability categories and distortions in test scores are likely to be complicated by the widespread trend in special education policies away from the use of formal taxonomies of disabilities to make decisions about individual children. For example, Maryland's guidelines for accommodations expressly mandate that "accommodations must be based upon individual needs and not upon a category of disability" (Maryland State Department of Education, 1995:2).
The tension between different needs and the uses of taxonomies of disability has been recognized for some time, but the inclusion of students with disabilities in standards-based reforms may make it more prominent. For example, Shepard (1989) noted that, for purposes of placement, it is often more important to ask what characteristics of a given child would make him or her a good candidate for special education treatments than to formally categorize his or her disability. Shepard, noted however, that the taxonomy that is useful for "construct diagnosis" (p. 568) or research purposes is often different from that needed for decisions about placement or practice. The taxonomic information needed to design validity-enhancing accommodations may be more like that needed for research than that needed for educational placement and practice. For decisions about placement and instruction, the critical information for disability classification is whether a particular group of students shares the need for, or ability to profit
from, specific educational interventions. For research purposes, however, other bases for classification may be important, such as shared causation of the disability (etiology). For assessment purposes, the key basis for classification is shared distortions in the meaning of unaccommodated test scores and shared responsiveness to specific accommodations. Therefore, it would be profitable for purposes of assessment to group students on the basis of disability, if doing so made it more feasible to implement specific accommodations that would enhance the validity of their scores, even if that classification had little usefulness for decisions about placement or instruction.
Research Evidence About Accommodations
Research on the validity of scores from accommodated assessments is limited, and little of it is directly applicable to the assessments that are central to standards-based reform. Much of the available evidence pertains to college admissions tests and other postsecondary tests (e.g., Wightman, 1993; Willingham et al., 1988).
Generalizing from the research on college admissions and postsecondary examinations would be risky. The populations are both higher-achieving and generally older than those taking the standards-based assessments, and the tests are different. In addition, given the purposes of college admissions tests, this research focused on predictive evidence of validity, which is less germane than concurrent evidence in the case of standards-based reform. Nonetheless, this research is suggestive, and there are reasons to suspect that it understates the difficulties that may arise when accommodations are offered in standards-based assessments. The groups taking tests like the SAT and GRE are higher-achieving than the student population as a whole and presumably include relatively few of the students who obtain low scores because of disabilities. In addition, until recently, students whose disabilities are directly related to tested constructs constituted a relatively small percentage of those taking college admissions tests and postsecondary exams; students with mental retardation generally do not take them, and until recently, relatively few of the students who took them were reported to have learning disabilities. Thus, most of these studies include relatively few students from the groups for whom the validity of scores is likely to be particularly problematic or especially difficult to ascertain, yet these students constitute well over half of all elementary and secondary school students with disabilities.
During the 1980s, researchers at the Educational Testing Service (ETS) conducted a series of studies of students with disabilities taking the SAT and the GRE, under both normal and accommodated conditions (Willingham et al., 1988). In terms of internal criteria—that is, evidence from the tests themselves—the results of the ETS studies found that reliability, factor structure, and test content appeared similar for students with and without disabilities. There was little evi-
dence of differential item functioning for students with hearing impairments, physical impairments, and learning disabilities.14
Predictive evidence for accommodated scores of students with disabilities, however, was weaker than for scores obtained under standard conditions. In general, test performance less accurately predicted subsequent grade point average (GPA) for students with disabilities than for individuals without disabilities.15 Furthermore, GPA was overpredicted for most groups with disabilities, suggesting that test scores had been overcorrected or inflated somewhat by accommodations.
These findings on the amount of time offered during assessments are particularly important. Certain disabilities and accommodations can slow the pace of examinees and, in such cases, providing additional time may be required to offset this distortion. However, ETS researchers found no evidence that individuals with disabilities taking the SAT in the aggregate actually required more time to complete the test. On the contrary, they found that, regardless of whether accommodations were used or which accommodations were used, students with hearing impairments, learning disabilities, physical disabilities, and visual impairments (even those students using a braille form) were more likely than other test-takers to complete both the verbal and mathematics sections in the scheduled time (Bennett et al., 1988:89).16
Some of these studies also found that the overprediction was strongest for relatively high-scoring students with learning disabilities who were given more time (Braun et al., 1988). They suggest that the extra time offered to some students with disabilities on the SAT may contribute to the overprediction of GPA by overcompensating for disabilities. Another study showed similar results when students with disabilities were given extra time on the Law School Admission Test (LSAT), leading the authors to suggest that "refinements in testing accommodations that adjust the amount of extra time to meet the specific needs of each accommodated test taker might decrease the amount of overprediction" (Wightman, 1993:52).
It is unclear how these findings apply to elementary and secondary school students with disabilities. The effects may vary, for example, as a function of the type of assessment or the age of students. Nonetheless, these findings suggest that a need for additional time should not be assumed. Clearly, the effects on test scores of providing additional time warrant empirical investigation.
The 1995 field test of the National Assessment of Educational Progress (NAEP) in mathematics and science provided evidence about accommodations more directly relevant to standards-based reform, because NAEP is similar to state standards assessments in terms of the students tested and the focus on measuring achievement. Because of study limitations, however, additional research is needed to confirm the NAEP findings. In order to explore the feasibility of including more students with disabilities in NAEP, the field test introduced two changes in NAEP procedures. First, the study introduced stricter rules governing the exclusion of students with disabilities from NAEP. Second, the study permitted a variety of assessment accommodations, which until then had been unavailable.
The study results showed that it is feasible to assess an appreciable percentage of students with disabilities who had previously been excluded from the assessment. Most of these students (approximately 60 percent) had learning disabilities. All but 13 percent were in general education classrooms for some part of the day. The authors attributed the increased participation rates of students with disabilities more to the provision of accommodations than to the new inclusion criteria (Phillips, 1995). In both grades 4 and 8, approximately 48 percent of the test-takers with disabilities used accommodations for achievement testing.
However, the field trial results for students with disabilities who had been offered accommodations could not be reported on the NAEP scale. The NAEP researchers offered several reasons for this conclusion: "Generally, the assessment was less discriminating for the IEP sample, with about two-thirds of the items having smaller item-total correlations for the IEP group and with [some] items having negative correlations. Also, omit rates were generally higher [for the IEP sample]" (Anderson et al., no date:39). The negative correlations on some items indicate that, as the proficiency of students with disabilities increased, their performance on those items actually decreased; this finding was confirmed by other analyses. In addition, a substantial number of items showed "differential item functioning," indicating a bias either for or against students with disabilities.
The study results, however, are only suggestive in their findings about the effects of accommodations on score comparability. The study was limited by small sample sizes, a problem exacerbated by the matrix-sampled nature of the test, which dramatically reduces the number of students administered any given item. In addition, study results were made more equivocal because of "the multiplicity of student disabilities and corresponding accommodations" (Anderson et al., no date:37). It is possible that, if samples were sufficiently large for specific combinations of disabilities and accommodations, assessment results could be
adequately scaled for some groups of students with disabilities. However, most state assessments will be faced with similar heterogeneity and small group sizes. Consequently, standards-based assessments are most likely to generate results for a multiplicity of disabilities and accommodations, with few of the specific combinations frequent enough to support separate scaling of assessment results.
Clearly, more research on the validity of scores from accommodated testing is needed—in particular, research tailored directly to the particular assessments and inferences central to standards-based reform. In the interim, the existing research, although limited and based largely on different populations and types of assessments, suggests the need for caution. The effects of accommodation cannot be assumed and may be quite different from what an a priori logical analysis might suggest.
Promising Approaches in Test Design
Research and development in the field of measurement is continually experimenting with and expanding the modes, formats, and technologies of testing and assessment. In addition to performance assessment, test developers and psychometricians are studying new ways of constructing test items and using computer technologies. Continued development of new forms of test construction may hold promise for the assessment of students with disabilities.
Item response theory (IRT) is one promising development. It is rapidly displacing classical test theory as the basis for modern test construction. IRT models describe "what happens when an examinee meets an item" (Wainer and Mislevy, 1990:66). IRT refers to a broad class of methods for constructing and scaling tests based on the notion that students' performance on a test should reflect one latent trait or ability and that a mathematical model should be able to predict performance on individual test items on the basis of that trait.17 To use IRT modeling in test construction and scoring, test items are first administered to a large sample of respondents. Based on these data, an IRT model is derived that predicts whether a given item will be answered correctly by a given individual on the basis of estimates of the difficulty of the particular item and the ability level of the individual. A good fitting model yields information about the difficulty of items for individuals of differing levels of ability. Items for which the model does not fit—that is, for which students' estimated ability does not predict performance well on the specific item—are typically discarded. This information is used subsequently to score performance when the test items are given to actual examinees.18
Item response theory offers several potential advantages for including students with disabilities in large-scale assessments. First, in many instances, assessments based on item response theory allow for everyone's scores to be placed on a common scale, even though different students have been given different items. Given the wide range of differences in performance levels across all students, including students with disabilities, it is unlikely that the same set of items will be appropriate for everyone. Second, item response theory makes it possible to assess changes in the reliability (precision) of scores as a function of a student's level of ability in what the assessment is measuring. Thus it is possible to identify an assessment that may not be reliable for low-scoring students with disabilities despite the fact that it has adequate reliability for high-scoring students. Third, item response theory provides sophisticated methods for identifying items that are biased for students with disabilities. Its use in this specific context raises a number of theoretical and practical issues, the exploration of which could prove very useful.
Computer-based testing is another area in which research holds great promise (Bennett, 1995). One of the assessment accommodations most often given to students with disabilities is extra time. But as noted earlier, extra time should be provided with caution, as it may undermine the validity of scores. Computer-based "adaptive" assessments allow students with a wide range of skills to be tested at a reasonable level of reliability and in a shorter amount of time by individually adapting the items presented to a test-taker's estimated level of skill, as gauged by his or her performance on an initial set of items. When tests can be administered individually on a computer, and when time pressure is lessened, it becomes possible to "give more time to everyone." Thus, computer-based adaptive tests can be shorter than traditional tests but still comply with measurement principles. The test is changed in a way that reduces the need for accommodated administration, thereby circumventing the problem of changes in the validity of scores due to accommodations. Finally, computer-based tests may allow students with disabilities to participate in simulated hands-on assessments by the addition of adaptive input devices, for example, a light pen mounted on a head strap. Such assessments can replace actual hands-on assessments that often require manual movements that are impossible for some students with disabilities. However, as Baxter and Shavelson (1994) have shown, computerized simulations of hands-on tasks can yield results surprisingly unlike those generated by the original tasks, so this approach will require careful evaluation.
Reporting on the Performance of Students with Disabilities
Because educational accountability depends on public knowledge about school and student performance, scores on assessments must be communicated in ways that provide accessible, valid, and useful information. Systems vary in their reporting mechanisms, depending on their primary units of accountability (state,
district, school, classroom, individual student), the frequency of testing and grade levels tested, and the uses of assessment data. The design of reporting mechanisms always involves critical choices because schools and groups of students are typically compared with each other, with themselves over time, or against a set of performance standards. In some cases, these comparisons may also result in rewards and sanctions. Consequently, ensuring fair comparisons becomes a major issue. The public's right to know and to have accountable schools must be balanced against individual student rights and the disparate resources and learning opportunities available to different schools and students.
Creating a fair and responsible reporting mechanism is one of the major challenges associated with expanding the participation of students with disabilities in large-scale assessments and public accountability systems. In this section, we examine two issues that must be considered in reporting on the performance of students with disabilities. One issue pertains to flagging—making a notation on the student's score report that identifies scores as having been obtained with accommodations or under other nonstandard conditions. A second issue relates to disaggregation—the separate reporting of scores for groups such as students with disabilities. In part, the resolution of these issues hinges on the uses to which scores are put, such as whether scores are reported at the aggregate or individual level. However, in many instances, there is no unambiguous resolution of these issues. The research base that might guide decisions is limited and, perhaps more important, an emphasis on different values leads to different conclusions about the best resolution.
Flagging is a concern when a nonstandard administration of an assessment—for example, providing accommodations such as extra time or a reader—calls into question the validity of inferences (i.e., the meaning) based on the student's score. Flagging warns the user that the meaning of the score is uncertain. The earlier section on validity and accommodations identified factors that suggest uncertainty about the meaning of scores from accommodated assessments.
However, since flagged scores are typically not accompanied by any descriptive detail about the individual or even the nature of accommodations offered, flagging may not really help users to interpret scores more appropriately. It may confront them with a decision about whether to ignore or discount the score simply because of the possibility that accommodations have created unknown distortions. Moreover, in the case of scores reported for individual students, flagging identifies the individual as having a disability, raising concerns about confidentiality and possible stigma.
In some respects, flagging is less of a problem when scores are reported only at the level of schools or other aggregates. Concerns about confidentiality and unfair labeling are lessened. Moreover, to the extent that the population with
disabilities and assessment accommodations is similar across the aggregate units being compared (say, two schools in one year, or a given school's fourth grades in two different years), flagging in theory would have little effect on the validity of inferences. In practice, however, the characteristics of the group with disabilities may be quite different from year to year or from school to school. Moreover, decisions about accommodations and other modifications may be made inconsistently. Thus, even in the case of scores reported only for aggregates, flagging may be needed to preserve the validity of inferences.
When testing technology has sufficiently advanced to ensure that accommodations do not confound the measurement of underlying constructs, then score notations will be unnecessary. Until then, however, flagging should be used only with the understanding that the need to protect the public and policy makers from misleading information must be weighed against the equally important need to protect student confidentiality and prevent discriminatory uses of testing information.
It is not yet clear what kinds of policies states and districts will adopt about disaggregating results for students with disabilities and other groups with special needs, but they will have to do at least some disaggregation under the Title I program. The new federal Title I legislation requires the results of Title I standards-based assessment to be disaggregated at the state, district, and school levels by race and ethnicity, gender, English proficiency, migrant status, and economic disadvantage and by comparisons of students with and without disabilities.
There are several arguments in favor of disaggregating the scores of students with and without disabilities. The first argument is one of validity: if the scores of some students with disabilities are of uncertain meaning, the validity of comparisons for the whole group would be enhanced by separating those scores. The second is about fairness: schools have varying numbers of students with disabilities from one cohort to another, and, to the extent that some of these students face additional educational burdens, disaggregation would lead to fairer comparisons. The third argument is one of accountability: separately reporting the scores of students with disabilities will increase the pressure on schools to improve the education offered to them. (Note that these same arguments apply for any group of students for whom scores are of uncertain meaning or who could benefit from a separate analysis of their performance, for example, students with limited English proficiency and Title I students.)
Whatever its merits, however, disaggregation confronts serious difficulties pertaining to the reliability of scores. One reason is simply the small number of students involved. The problems of low numbers will be most severe for school-level reporting. An elementary school that has 50 students in a tested grade, for example, is likely to have perhaps 4 to 6 students with disabilities in that grade. The unreliability of disaggregated scores is exacerbated by the ambiguous and
variable identification of students as having a disability; a student identified and hence included in the score for students with disabilities in one school or cohort may well not be identified in another. The diversity of these students also augments the problem of reliability; in one cohort of five students with disabilities, there might be one with autism and one with retardation, whereas another cohort might include none with autism or retardation but a highly gifted student with a visual disability. Thus, for example, a change that would appear to indicate improvement or deterioration in the education afforded students with disabilities in fact could represent nothing but differences in the composition of the small groups of students with disabilities.
The enormous diversity among those in the category of students with disabilities is the primary argument in favor of a more detailed disaggregation of scores by type of disability. In theory, detailed disaggregation could alleviate some of the distortions caused by cohort-to-cohort differences in disabilities. Moreover, it could provide more meaningful comparisons. Students whose disability is partial blindness, for example, might be more meaningfully compared with students without disabilities than with students with mental retardation or autism. Detailed disaggregation exacerbates the problem of small numbers, however, particularly for the less common disabilities. For example, the national prevalence rate for identified visual disabilities served under the Individuals with Disabilities Education Act (IDEA) or the state-operated programs of Chapter 1 in the 6–17 age range was 0.05 percent in the 1993–94 school year (U.S. Department of Education, 1995:Table AA16). Thus, in our hypothetical example of an elementary school with 50 students in a tested grade, one can expect a student identified as visually impaired to appear in that grade, on average, once every 40 years. Although detailed disaggregation may improve the meaningfulness of results for larger groups and larger aggregates, it will not provide useful aggregate comparisons for smaller disability groups or smaller aggregates. Detailed disaggregation also would run counter to the current movement within special education to avoid formal classifications and to focus instead on individual students' functional capabilities and needs.
As with flagging, those making decisions about data disaggregation in state reporting systems should weigh the need for valid and useful information equally with consideration of any potentially adverse effects on individuals. Care must be taken so that disaggregated data do not allow identification of results for individual students. The usual approach to this problem is not to report results for any cell in a table with a sample size below a certain number of students (e.g., five).
Legal Framework for Assessing Students with Disabilities19
The federal statutes and regulations governing the education of students with disabilities recognize the importance of the validity of tests and assessments. The
regulations implementing both the IDEA and Section 504 of the Rehabilitation Act of 1973 require that tests and other evaluation materials must be validated for the specific purpose for which they are used. Both sets of regulations also require that, when a test is administered to a student with impaired sensory, manual, or speaking skills, the test results accurately reflect the child's aptitude or achievement level or whatever other factors the test purports to measure, rather than reflecting the student's disabilities.
Accommodations for disabilities in testing or assessment are also required by these federal statutes and regulations. Both Section 504 and the Americans with Disabilities Act (ADA) require that individuals with disabilities be protected against discrimination on the basis of disability and be allowed access to equally effective programs and services as received by their peers without disabilities. The ADA regulations require that public entities must make ''reasonable modification" in policies, practices, and procedures when "necessary to avoid discrimination on the basis of disability, unless the public entity can demonstrate that making the modifications would fundamentally alter the nature of the service, program, or activity" (28 CFR 35.130[b] ). Alternate forms or accommodations in testing are required, but alterations of the content of what is tested are not required by law.
For purposes of analyzing potential legal claims on behalf of students with disabilities, distinctions among the various purposes and uses of assessments become critical. Assessments may, for example, be designed primarily as an accountability mechanism for schools and school systems. They may also be used as an integral part of learning, instruction, and curriculum. Or a particular test or tests may be used as a basis for making high-stakes decisions about individual students, including who is placed in the honors curriculum, who is promoted from grade to grade, who receives a high school diploma or a certificate indicating that a student has mastered a certain set of skills deemed relevant to the workplace. Each use raises its own set of legal issues and has different implications.
As a general rule, the greater the potential harm to students, the greater the protection that must be afforded to them and the more vulnerable the assessment is to legal challenge. One set of federal courts has already addressed the constitutional issues arising when a state links performance on a statewide test to the award of a high school diploma. A federal appellate court held unconstitutional a Florida law requiring students to pass a statewide minimum competency test in order to receive a high school diploma. The court in Debra P. v. Turlington held that the state's compulsory attendance law and statewide education program granted students a constitutionally protected expectation that they would receive a diploma if they successfully completed high school. Since the state possessed this protected property interest, the court held that the state was barred under the due process clause of the federal Constitution from imposing new criteria, such as the high school graduation test, without adequate advance notice and sufficient educational opportunities to prepare for the test. The court was persuaded that
such notice was necessary to afford students an adequate opportunity to prepare for the test, to allow school districts time to develop and implement a remedial program, and to provide an opportunity to correct any deficiencies in the test and set a proper cut score for passing (644 F. 2d 397, 5th Cir. 1981; see also Brookhart v. Illinois State Bd. of Ed., 697 F. 2d 179, 7th Cir. 1983).20
The court in Debra P. further held that, in order for the state's test-based graduation requirements to be deemed constitutional, the high school test used as its basis must be valid. In the view of the court, the state had to prove that the test fairly assessed what was actually taught in school. Under this concept, which the court referred to as "curricular validity," the test items must adequately correspond to the required curriculum in which the students should have been instructed before taking the test, and the test must correspond to the material that was actually taught (not just supposed to have been taught) in the state's schools.
As the court in Debra P. held: "fundamental fairness requires that the state be put to the test on the issue of whether the students were tested on material they were or were not taught…. Just as a teacher in a particular class gives the final exam on what he or she has taught, so should the state give its final exam on what has been taught in its classrooms" (644 F.2d at 406). In reaching this ruling, the court specifically rejected the state's assurance that the content of the test was based on the minimum, state-established performance standards, noting that the state had failed to document such evidence and that no studies had been conducted to ensure that the skills being measured were in fact taught in the classrooms (Pullin, 1994).
The same types of issues addressed by the court in Debra P. were also assessed in federal litigation on the impact of a similar test-for-diploma requirement imposed by a local school district in Illinois. The Illinois case, Brookhart v. Illinois State Board of Education (697 F. 2d 179), specifically assessed the impact on students with disabilities who had been in special education of using a minimum competency test to determine the award of high school diplomas. The court held that students with disabilities could be held to the same graduation standards as other students, but that their "programs of instruction were not developed to meet the goal of passing the [minimum competency test]" (697 F. 2d at 187). The court found that "since plaintiffs and their parents knew of the [test] requirements only one to one-and-a-half years prior to the students' anticipated graduation, the [test] objectives could not have been specifically incorporated into the IEP's over a period of years." The court counseled that the notice or opportunity to learn requirement could be met if the school district could ensure that students with disabilities are sufficiently exposed to most of the material that
appears on the test. These constitutional principles are consistent with the opportunity-to-learn requirements derived from the IDEA, Section 504, the ADA, and state constitutions.
The expanded participation of students with disabilities in state assessments, coupled with the curriculum and performance standards embodied in standards-based reform, are likely to raise new legal questions and require additional interpretations of existing statutes. Nevertheless, it is clear that several legal principles will continue to govern the involvement of students with disabilities in state assessments. Chief among them are the requirements that reasonable accommodations or alternate testing forms be provided consistent with the content being measured and that, in the case of assessments with individual consequences, students be afforded the opportunity to learn the content tested.
As states and school districts implement new forms of assessment, they face both development and operations costs. Performance-based assessments need to be developed, field tested, and made available to teachers and schools. While most development costs are incurred in the first few years, item pools need to be replenished and upgraded. The cost of replenishing the pool will be driven in part by the use of, and thus the need to secure, the items. Operational costs are ongoing. Teachers must be trained in how to administer and score new assessment formats, as well as how to integrate performance-based tasks into their daily teaching. Teachers also need to be shown how to make appropriate modifications and adaptations in assessments for students with special needs, including students with disabilities. Unlike standardized tests, which are scored externally and have computer-generated reports, teachers must then be given the time to score and interpret the results of the new assessments.
We know little about the cost of developing and implementing large-scale performance-based assessment systems, and we have no empirical data on the cost of including students with disabilities in these assessments. Estimated costs of performance-based assessment programs range from less than $2 to over $100 per student tested. This variation reflects differences in the subjects tested, how many students are tested, how they are assessed (e.g., mix of multiple-choice, open-ended questions, performance tasks, portfolios), who is involved in the development, administration, and scoring of the test (e.g., paid contractors or volunteer teachers), how much and what kind of training is provided, and the type and source of materials used in the assessment tasks. We do know, however, that compared with machine scoring of traditional tests, scoring costs for performance tasks are much greater. In addition, because of the large number of items on traditional tests, individual test items can be retained over several years. But tasks used for performance assessments must be replaced more frequently, compounding costs associated with item development and equating.
Comfort (1995, as cited in Stecher and Klein, 1997), for example, reported that the science portion of the California Learning Assessment System (CLAS)—half multiple-choice and half hands-on testing—cost the state just $1.67 per student, but much of the time needed to develop, administer, and score the science performance tasks was donated by teachers, and many of the materials used in the assessment were contributed as well. Picus (1995) found that Kentucky spent an average of $46 per student tested for each annual administration between 1991 and 1994, or about $9 per student for each of the five subjects tested. This figure also does not include any teacher or district expenditures (e.g., for training or teacher time for scoring student portfolios).
In contrast, Monk (1995) projects the cost of implementing the New Standards Project assessment system at $118 per tested student; this approach, involving a consortium of states and local districts, incorporates a considerable level of professional development (about 20 percent of operating costs) and a heavy emphasis on cumulative portfolio assessment. Stecher and Klein (1997) estimate that one period of hands-on science assessment for a large student population, administered under standardized conditions, would cost approximately $34 per student, about 60 times the cost of a commercial multiple-choice science test. Although one session of performance assessment is sufficient to generate reliable school or district scores, three to four periods of performance tasks are needed to produce an individual student score as reliable as one period of multiple-choice testing, potentially raising the cost of performance assessment even higher.
Accommodations in assessment and instruction generally entail additional costs. Sometimes these costs are minimal, such as providing a student with a calculator. But often the costs are more significant and involve additional personnel, equipment, and materials; examples include providing a reader or scribe, preparing a braille or large-print editions of an assessment, and providing high-tech equipment.
IMPLICATIONS OF INCREASED PARTICIPATION OF STUDENTS WITH DISABILITIES
As noted earlier, many people have encouraged the participation of students with disabilities in large-scale assessments with the hope that it will increase their participation in the general education curriculum and result in greater accountability for their educational performance. At this time, evidence is scarce about how the participation of students with disabilities in assessments affects their educational opportunities. Research is currently under way in a few states that have taken the lead with policies to increase participation, but it will be some time before those efforts can provide substantial information.
Greater participation of students with disabilities in large-scale assessments could have both positive and negative effects on aggregated test scores. To some degree, the effects will hinge on the extent to which valid scores can be provided
for individual students with disabilities—for example, by determining which accommodations can contribute to more accurate measurement. On one hand, if rules pertaining to accommodations (or modifications) are too permissive, they may falsely inflate scores for students who should not get the accommodation. This result could provide an escape valve, lessening the pressure on educators to bring students with disabilities up to the performance standards imposed on the general education population. On the other hand, policies that guide educators toward providing appropriate accommodations in both assessment and instruction could improve the validity of scores for students with disabilities. Linking accommodations in assessment and instruction—for example, by requiring, as Kentucky does, that accommodations be provided in the state's large-scale assessment only if they are also offered in ongoing instruction—may help limit inappropriate accommodation in assessment and encourage appropriate instructional accommodation. Evidence on the effects of these policies, however, is still lacking.
Decisions about participation and accommodations will need to be linked to decisions about reporting and, ultimately, accountability. Keeping track of who is included in the data being reported and under what conditions will be of central importance to ensuring fair comparisons between aggregates. Current decisions about which students with disabilities will participate in assessments are made inconsistently from place to place. This variation makes comparisons between two districts problematic if, for example, one has excluded only 2 percent of its students, and the other has excluded 10 percent. In addition to making results noncomparable from place to place, high rates of exclusion create an incomplete, and possibly inaccurate, view of student performance. For example, a recent study of four states with widely different exclusion rules for the 1994 NAEP reading assessment was conducted by the National Academy of Education (1996). The study found that applying a consistent rule for excluding students with low reading levels increased the number of participating students with disabilities by an average of 4.3 percent in each state; furthermore, when these students were included in the reporting, the mean fourth grade NAEP reading scores were somewhat lower. The size of the decrease varied from state to state (ranging from 1.5 to 3.1 points on the NAEP scale); predictably, the lowest decrease occurred for the state that was already including more students with disabilities. Reporting participation rates of students with disabilities in a consistent and systematic manner is important if comparisons are to be made fairly. Increased participation rates could also contribute to a more accurate description of student performance.
If greater participation of students with disabilities is achieved through the use of highly permissive policies about accommodations, the aggregated results may not be accurate, either. For example, the 1995 NAEP field test results suggested that a combination of stricter rules for exclusion and permissive rules about accommodations apparently led some schools to use accommodations for students who could have participated without them (Phillips, 1995). Although em-
pirical evidence is limited, it has been suggested (as reviewed earlier) that some accommodations may inflate scores for some students. If accommodations are offered to a number of students who do not really need them, their scores may be artificially inflated, offering an overly optimistic view of progress. Parents, teachers, and schools clearly need meaningful information and do not want to become falsely complacent about the progress of students with disabilities. Careful policies about what accommodations can be offered and to whom is important, as is keeping track of who has been tested with what accommodations.
If students with disabilities are to gain any benefits from standards-based reform, the education system must be held publicly accountable for every student's performance. Although the IEP will remain the primary accountability tool for individual students with disabilities, the quality of their learning should also count in judgments about the overall performance of the education system. Without such public accounting, schools have little incentive to expand the participation of students with disabilities in the common standards. Therefore, regardless of the different ways that students with disabilities may be assessed, they should be accounted for in data about system performance.
The presumption should be that all students will participate in assessments associated with standards-based reform. Assessments not only serve as the primary basis of accountability, but also they are likely to remain the cornerstone and often the most well-developed component of the standards movement. The decision to exclude a student from participation in the common assessment should be made and substantiated on a case-by-case basis, as opposed to providing blanket exclusions on the basis of categories of disability, and should be based on a comparison of the student's curriculum and educational goals with those measured by the assessment program.
Existing data are inadequate to determine participation rates for students with disabilities in extant assessments associated with standards-based reform or to track the assessment accommodations they have received. What few data do exist suggest considerable variability in participation rates among states and among local educational agencies within states. Policies pertaining to assessment accommodations also vary markedly from one state to another, and there is little information indicating the consistency with which local practitioners in a given state apply those guidelines. Variability in participation rates and accommodations threatens the comparability of scores, can distort trends over time as well as comparisons among students, schools, or districts, and therefore undermines the use of scores for accountability.
Significant participation of students with disabilities in standards-based reform requires that their needs and abilities be taken into account in establishing standards, setting performance levels, and selecting appropriate assessments.
Mere participation in existing assessments falls short of providing useful information about the achievement of students with disabilities or for ensuring that schools are held accountable for their progress. Assessments associated with standards-based reform should be designed to be informative about the achievement of all students, including those with low-incidence, severe disabilities whose curriculum requires that they be assessed with an alternate testing instrument. Adhering to sound assessment practices will go a long way toward reaching this goal. In particular, task selection and scoring criteria need to accommodate varying levels of performance. However, it may also prove essential that the development of standards and assessments be informed by knowledge about students with disabilities. Representatives of students with disabilities should be included in the process of establishing standards and assessments.
Assessment accommodations should be used only to offset the impact of disabilities and should be justified on a case-by-case basis. Used appropriately, accommodations should be an effort to improve the validity of scores by removing the distortions or biases caused by disabilities. In some instances, accommodations may also permit inclusion of students who otherwise would not be able to participate in an assessment; for example, braille editions of tests permit the assessment of blind students who would otherwise be excluded. Although accommodations will often raise scores, raising scores per se is not their purpose, and it is inappropriate to use them merely to raise scores. Research on the effects of accommodations, although limited, is sufficient to raise concerns about the potential effects of excessive or poorly targeted accommodations.
The meaningful participation of students with disabilities in large-scale assessments and compliance with the legal rights of individuals with disabilities in some instances require steps that are beyond current knowledge and technology. For example, regulations implementing the IDEA and Section 504 require that tests and other evaluation materials must be validated for the specific purpose for which they are used. Individuals with disabilities are also entitled to "reasonable" accommodations and adaptations that do not fundamentally alter the content being tested. Even in the case of traditional assessments, testing experts do not yet know how to meet these two requirements for many individuals with disabilities, particularly those with cognitive disabilities that are related to measured constructs. Moreover, the nature of assessments associated with standards-based reform is in flux. The validity of new forms of assessment has not yet been adequately determined for students in general, and we have even less evidence available for students with disabilities, particularly when testing accommodations are provided.
A critical need exists for research and development on assessments associated with standards-based reform generally, and on the participation of students with disabilities in particular. The recent development of assessments associated with standards-based reform, combined with the existence of legal rights governing the education of students with disabilities, has required that state education
agencies, local education agencies, and local school personnel design and implement assessment procedures that in some cases are beyond the realm of existing, expert knowledge. The sooner the research base can match the demands of policy, the more likely that students with disabilities can participate meaningfully in standards-based assessments.