12
Findings and Recommendations

The Congress asked the National Academy of Sciences to "conduct a study and make written recommendations on appropriate methods, practices and safeguards to ensure that—

  1. existing and new tests that are used to assess student performance are not used in a discriminatory manner or inappropriately for student promotion, tracking or graduation; and
  2. existing and new tests adequately assess student reading and mathematics comprehension in the form most likely to yield accurate information regarding student achievement of reading and mathematics skills."
  • Congressional interest in this subject stems from the widespread movement in the United States for standards-based school reform, from the consideration of voluntary national tests, and from the increased reliance on achievement tests for various forms of accountability: for school systems, individual schools, administrators, teachers, and students. Moreover, there are sustained high levels of public support for high-stakes testing of individual students, even if it would lead to lower rates of promotion and high school graduation (Johnson and Immerwahr, 1994; Hochschild and Scott, 1998). Because large-scale testing is increasingly used for high-stakes purposes to make decisions that significantly affect the life chances of individual students, the Congress has asked the National



  • The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
    Copyright © National Academy of Sciences. All rights reserved.
    Terms of Use and Privacy Statement



    Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
    Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

    Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

    OCR for page 273
    --> 12 Findings and Recommendations The Congress asked the National Academy of Sciences to "conduct a study and make written recommendations on appropriate methods, practices and safeguards to ensure that— existing and new tests that are used to assess student performance are not used in a discriminatory manner or inappropriately for student promotion, tracking or graduation; and existing and new tests adequately assess student reading and mathematics comprehension in the form most likely to yield accurate information regarding student achievement of reading and mathematics skills." Congressional interest in this subject stems from the widespread movement in the United States for standards-based school reform, from the consideration of voluntary national tests, and from the increased reliance on achievement tests for various forms of accountability: for school systems, individual schools, administrators, teachers, and students. Moreover, there are sustained high levels of public support for high-stakes testing of individual students, even if it would lead to lower rates of promotion and high school graduation (Johnson and Immerwahr, 1994; Hochschild and Scott, 1998). Because large-scale testing is increasingly used for high-stakes purposes to make decisions that significantly affect the life chances of individual students, the Congress has asked the National

    OCR for page 273
    --> Academy of Sciences, through its National Research Council, for guidance in the appropriate and nondiscriminatory use of such tests. This study focuses on tests that, by virtue of their use for promotion, tracking, or graduation, have high stakes for individual students. The committee recognizes that accountability for students is related in important ways to accountability for educators, schools, and school districts. This report does not address accountability at those other levels, apart from the issue of participation of all students in large-scale assessments. The report is intended to apply to all schools and school systems in which tests are used for student promotion, tracking, or graduation. Test form (as mentioned in part B of the congressional mandate) could refer to a wide range of issues, including, for example, the balance of multiple-choice and constructed-response items, the use of student portfolios, the length and timing of the test, the availability of calculators or manipulatives, and the language of administration. However, in considering test form, the committee has chosen to focus on the needs of English-language learners and students with disabilities, in part because these students may be particularly vulnerable to the negative consequences of large-scale assessments. We consider, for these students, in what form and manner a test is most likely to measure accurately a student's achievement of reading and mathematics skills. Two policy objectives are key for these special populations. One is to increase their participation in large-scale assessments, so that school systems can be held accountable for their educational progress. The other is to test each such student in a manner that accommodates for a disability or limited English proficiency to the extent that either is unrelated to the subject matter being tested, while still maintaining the validity and comparability of test results among all students. These objectives are in tension, and thus present serious technical and operational challenges to test developers and users. Assessing the Uses of Tests In its deliberations the committee has assumed that the use of tests in decisions about student promotion, tracking, or graduation is intended to serve educational policy goals, such as setting high standards for student learning, raising student achievement levels, ensuring equal educational opportunity, fostering parental involvement in student learning, and increasing public support for the schools.

    OCR for page 273
    --> Determining whether the use of tests for student promotion, tracking, or graduation produces better overall educational outcomes requires that the various intended benefits of high-stakes test use be weighed against unintended negative consequences for individual students and groups of students. The costs and benefits of testing should also be balanced against those of making high-stakes decisions about students in other ways, using criteria other than test scores; decisions about tracking, promotion, and graduation will be made with or without information from standardized tests. The committee recognizes that test use may have negative consequences for individual students even while serving important social or educational policy purposes. We believe that the development of a comprehensive testing policy should be sensitive to the balance among individual and collective benefits and costs. The committee follows an earlier work by the National Research Council (1982) in adopting a three-part framework for determining whether a planned or actual test use is appropriate. The three principal criteria are (1) measurement validity—whether a test is valid for a particular purpose and the constructs measured have been correctly chosen; (2) attribution of cause—whether a student's performance on a test reflects knowledge and skill based on appropriate instruction or is attributable to poor instruction or to such factors as language barriers or construct-irrelevant disabilities; and (3) effectiveness of treatment—whether test scores lead to placements and other consequences that are educationally beneficial. This framework leads us to emphasize several basic principles of appropriate test use. First, the important thing about a test is not its validity in general, but its validity when used for a specific purpose. Thus, tests that are useful in leading the curriculum or in school accountability are not appropriate for use in making high-stakes decisions about individual student mastery unless the curriculum, the teaching, and the tests are aligned. Second, tests are not perfect. Test questions are a sample of possible questions that could be asked in a given area. Moreover, a test score is not an exact measure of a student's knowledge or skills. A student's score can be expected to vary across different versions of a test—within a margin of error determined by the reliability of the test—as a function of the particular sample of questions asked and/or transitory factors, such as the health of the student on the day of the test. Third, an educational decision that will have a major impact on a test

    OCR for page 273
    --> taker should not solely or automatically be made on the basis of a single test score. Other relevant information about the student's knowledge and skills should also be taken into account. Finally, neither a test score nor any other kind of information can justify a bad decision. For example, research shows that tracking, as typically practiced, harms students placed in low-track classes. In the absence of better treatments, better tests will not lead to better educational outcomes. Throughout the report, the committee has considered how these principles apply to the appropriate use of tests in decisions about tracking, promotion, and graduation and to possible uses of the proposed voluntary national tests. Blanket criticisms of testing and assessment are not justified. When tests are used in ways that meet relevant psychometric, legal, and educational standards, students' scores provide important information that, combined with information from other sources, can lead to decisions that promote student learning and equality of opportunity (Office of Technology Assessment, 1992). For example, tests can identify learning differences among students that the education system needs to address. Because decisions about tracking, promotion, and graduation will be made with or without testing, proposed alternatives to testing should be at least equally accurate, efficient, and fair. It is also a mistake to accept observed test scores as either infallible or immutable. When test use is inappropriate, especially in the case of high-stakes decisions about individuals, it can undermine the quality of education and equality of opportunity. For example, it is wrong to suggest that the lower achievement test scores of racial and ethnic minorities and students from low-income families reflect inalterable realities of American society.1 Such scores reflect persistent inequalities in American society and its schools, and the inappropriate use of test scores can legitimate and reinforce these inequalities. This lends a special urgency to the requirement that test use in connection with tracking, promotion, and graduation should be appropriate and fair. With respect to the use of tests in making high-stakes decisions about students, the committee concludes that statements about the benefits and harms of testing often go beyond what the evidence will support. 1   For recent evidence of major changes in group differences in test scores, see Hauser (1998), Grissmer et al. (1998), Huang and Hauser (1998), and Ceci et al. (1998).

    OCR for page 273
    --> Cross-Cutting Themes In important ways, educational decisions about tracking, promotion, and graduation are different from one another. They differ most importantly in the role that mastery of past material and readiness for new material play as decision-making criteria and in the importance of beneficial educational placement relative to certification as consequences of the decision. Thus, we have considered the role of large-scale, high-stakes testing separately in relation to each type of decision. However, tracking, promotion, and graduation also share common features that pertain to appropriate test use and to their educational and social consequences. These include the alignment between testing and the curriculum, the social and economic sorting that follows from the decisions, the range of educational options potentially linked to the decisions, the use of multiple sources of evidence, the use of tests among young children, and improper manipulation of test score outcomes for groups or individuals. Even though we also raise some of these issues in connection with specific decisions, each of them cuts across two or more types of decisions. We therefore discuss them jointly in this section before turning separately to the use of tests in tracking, promotion, and graduation decisions. It is a mistake to begin educational reform by introducing tests with high stakes for individual students. If tests are to be used for high-stakes decisions about individual mastery, such use should follow implementation of changes in teaching and curriculum that ensure that students have been taught the knowledge and skills on which they will be tested. Some school systems are already doing this by planning a gap of several years between the introduction of new tests and the attachment of high stakes to individual student performance, during which schools may achieve the necessary alignment among tests, curriculum, and instruction. Others may see high-stakes student testing as a way of leading curricular reform, not recognizing the danger that a test may lack the "instructional validity" required by law (Debra P. v. Turlington, 1981)—that is, a close correspondence between test content and instructional content. To the extent that all students are expected to meet "world-class" standards, there is a need to provide world-class curricula and instruction to all students. However, in most of the nation, much needs to be done before a world-class curriculum and world-class instruction will be in place (National Academy of Education, 1996). At present, curriculum does not usually place sufficient emphasis on student understanding and

    OCR for page 273
    --> application of concepts, as opposed to memorization and skill mastery. In addition, instruction in core subjects typically has been and remains highly stratified. What teachers teach and what students learn vary widely by track, with those in lower tracks receiving far less than a world-class curriculum. If world-class standards were suddenly adopted, student failure would be unacceptably high (Linn, 1998a). Recommendation: Accountability for educational outcomes should be a shared responsibility of states, school districts, public officials, educators, parents, and students. High standards cannot be established and maintained merely by imposing them on students. Recommendation: If parents, educators, public officials, and others who share responsibility for educational outcomes are to discharge their responsibility effectively, they should have access to information about the nature and interpretation of tests and test scores. Such information should be made available to the public and should be incorporated into teacher education and into educational programs for principals, administrators, public officials, and others. Recommendation: A test may appropriately be used to lead curricular reform, but it should not also be used to make high-stakes decisions about individual students until test users can show that the test measures what they have been taught. The consequences of high-stakes testing for individual students are often posed as a either-or propositions, but this need not be the case. For example, social promotion and simple retention in grade are really only two of many educational strategies available to educators when test scores and other information indicate that students are experiencing serious academic difficulty. Neither social promotion nor retention alone is an effective treatment, and schools can use a number of possible strategies to reduce the need for these either-or choices—for example, by coupling early identification of such students with effective remedial education. Similar observations hold for decisions about tracking and about high school graduation. Recommendation: Test users should avoid simple either-or options when high-stakes tests and other indicators show that

    OCR for page 273
    --> students are doing poorly in school, in favor of strategies combining early intervention and effective remediation of learning problems. Large-scale assessments are used widely to make high-stakes decisions about students, but they are most often used in combination with other information, as recommended by the major professional and scientific organizations in testing (American Educational Research Association et al., 1985, 1998; Joint Committee on Testing Practices, 1988). For example, according to a recent survey, teacher-assigned grades, standardized tests, developmental factors, attendance, and teacher recommendations form the evidence on which most school districts say that they base promotion decisions (American Federation of Teachers, 1997). A test score, like any other source of information about a student, is not exact. It is an estimate of the student's understanding or mastery of the content that a test was intended to measure. Recommendation: High-stakes decisions such as tracking, promotion, and graduation should not automatically be made on the basis of a single test score but should be buttressed by other relevant information about the student's knowledge and skills, such as grades, teacher recommendations, and extenuating circumstances. Problems of test validity are greatest among young children, and there is a greater risk of error when such tests are employed to make significant educational decisions about children who are less than 8 years old or below grade 3—or about their schools. However, well-designed assessments may be useful in monitoring trends in the educational development of populations of students who have reached age 5 (Shepard et al., 1998). Recommendation: In general, large-scale assessments should not be used to make high-stakes decisions about students who are less than 8 years old or enrolled below grade 3. All students are entitled to sufficient test preparation, but it is not proper to expose them ahead of time to items that will actually be used on their test or to give them the answers to those questions. Test results may also be invalidated by teaching so narrowly to the objectives of a particular test that scores are raised without actually improving the broader set of academic skills that the test is intended to measure (Koretz et al.,

    OCR for page 273
    --> 1991). The committee also recognizes that the desirability of "teaching to the test" is affected by test design. For example, it is entirely appropriate to prepare students by covering all the objectives of a test that represents the full range of the intended curriculum. Thus, it is important that test users respect the distinction between genuine remedial education and teaching narrowly to the specific content of a test. Recommendation: All students are entitled to sufficient test preparation so their performance will not be adversely affected by unfamiliarity with item format or by ignorance of appropriate test-taking strategies. Test users should balance efforts to prepare students for a particular test format against the possibility that excessively narrow preparation will invalidate test outcomes. There is an inherent conflict of interest when teachers administer high-stakes tests to their own students or score their own students' exams. On one hand, teachers want valid information about how well their students are performing. On the other hand, there is often substantial external pressure on teachers (as well as principals and other school personnel) for their students to earn high scores. This external pressure may lead some teachers to provide inappropriate assistance to their students before and during the test administration or to mis-score exams. The prevalence of such inappropriate practices varies among and within states and schools. Consequently, when there is evidence of a problem, such as from observations or other data, formal steps should be taken to ensure the validity of the scores obtained. This could include having an external monitoring system with sanctions, or having someone external to the school administer the tests and ensure their security, or both. Only in this way can the scores obtained from high-stakes tests be trusted as providing reasonably accurate results regarding student performance. Members of some minority groups, English-language learners, and low-socioeconomic (SES) students are overrepresented in lower-track classes and among those denied promotion or graduation on the basis of test scores. Moreover, these same groups of students are underrepresented in high-track classes, "exam" schools, and "gifted and talented" programs (Oakes et al., 1992). In some cases, such as courses for English-language learners, such disproportions are not problematic. We would not expect to find native English speakers in classes designed to teach English to English-language learners. In other circumstances, such disproportions raise serious questions.

    OCR for page 273
    --> For example, although the grade placement of 6-year-olds is similar among boys and girls and among racial and ethnic groups, grade retardation among children cumulates rapidly after age 6, and it occurs disproportionately among males and minority group members. Among children 6 to 8 years old in 1987, 17 percent of white females and 22 percent of black males were enrolled below the modal grade for their age. By ages 9 to 11, 22 percent of white females and 37 percent of black males were enrolled below the modal grade for their age. In 1996, when the same children were 15 to 17 years old, 29 percent of white females and 48 percent of black males were either enrolled below the modal grade level for their age or had dropped out of school (U.S. Bureau of the Census, Current Population Reports, Series P-20). These disproportions are especially disturbing in view of other evidence that grade retention and assignment to low tracks have little educational value. The concentrations of minority students, English-language learners, and low-SES students among those retained in grade, denied high school diplomas, and placed in less demanding classes raise significant questions about the efficacy of schooling and the fairness of major educational decisions, including those made using information from high-stakes tests. The committee sees a strong need for better evidence on the benefits and costs of high-stakes testing. This evidence should tell us whether the educational consequences of particular decisions are educationally beneficial for students, e.g., by increasing academic achievement or reducing school dropout. It is also important to develop statistical reporting systems of key indicators that will track both intended effects (e.g., higher test scores) and other effects (e.g., changes in dropout or special education referral rates). For example, some parents or educators may improperly seek to classify their students as disabled in order to take advantage of accommodation in high-stakes tests. Indicator systems could include measures such as retention rates, special education identification rates, rates of exclusion from assessment programs, number and type of accommodations, high school completion credentials, dropout rates, and indicators of access to high-quality curriculum and instruction. Recommendation: High-stakes testing programs should routinely include a well-designed evaluation component. Policymakers should monitor both the intended and unintended consequences of high-stakes assessments on all students and on significant subgroups of students, including minorities, English-language learners, and students with disabilities.

    OCR for page 273
    --> Appropriate Uses of Tests in Tracking, Promotion, and Graduation Tracking The intended purpose of tracking is to place each student in an educational setting that is optimal given his or her knowledge, skills, and interests. Support for tracking stems from a widespread belief that students will perform optimally if they receive instruction in homogeneous classes and schools, in which the pace and nature of instructions are tailored to their achievement levels (Oakes et al., 1992; Gamoran and Weinstein, 1998). The research evidence on this point, however, is unclear (Mosteller et al., 1996). "Tracking" takes many different forms, including: (1) grouping between classes within a grade level based on perceived achievement or skill level; (2) selection for exam schools or gifted and talented programs; (3) identification for remedial education programs, such as "intervention" schools; and (4) referral for possible placement in special education (Oakes et al., 1992; Mosteller et al., 1996). Tracking is common in American schools, but tracking policies and practices vary, not only from state to state and district to district, but also from school to school. Because tracking policies and procedures are both diverse and decentralized, it is difficult to generalize about the use of tests in tracking. Research suggests that (1) as a result of tracking, the difference in average achievement of students in different classes in the same school is far greater in the United States than in most other countries (Linn, 1998a); (2) instruction in low-track classes is far less demanding than in high-track classes (Welner and Oakes, 1996; McKnight et al., 1987); (3) students in low-track classes do not have the opportunity to acquire knowledge and skills strongly associated with future success; and (4) many students in low-track classes would acquire such knowledge and skills if placed in more demanding educational settings (Slavin et al., 1996; Levin, 1988; Title I of the Elementary and Secondary Education Act). Recommendation: As tracking is currently practiced, low-track classes are typically characterized by an exclusive focus on basic skills, low expectations, and the least-qualified teachers. Students assigned to low-track classes are worse off than they would be in other placements. This form of tracking should be eliminated. Neither test scores nor other information should be used to place students in such classes.

    OCR for page 273
    --> Some forms of tracking, such as proficiency-based placement in foreign language classes and other classes for which there is a demonstrated need for prerequisites, may be beneficial. We make no attempt here to enumerate all forms of beneficial tracking. The general criterion of what constitutes beneficial tracking is that a student's placement, in comparison with other available placements, yields the greatest chance that the student will acquire the knowledge and skills strongly associated with future success. The role that tests play in tracking decisions is an important and subtle issue. Educators consistently report that, whereas test scores are routinely used in making tracking decisions, most within-grade tracking decisions are based not solely on test scores but also on students' prior achievement, teacher judgment, and other factors (White et al., 1996; Delany, 1991; Selvin et al., 1990). Research also suggests that "middle class parents intervene to obtain advantageous positions for their children" regardless of test scores or of teacher recommendations (Lucas, in press:206). Nonetheless, even when test scores are just one factor among several that influence tracking decisions, they may carry undue weight by appearing to provide scientific justification and legitimacy for tracking decisions that such decisions would not otherwise have. Some scholars believe that reliance on test scores increases the disproportionate representation of poor and minority students in low-track classes. However, test use can also play a positive role, as when a relatively high test score serves to overcome a negative stereotype (Lucas, in press). Tests may play an important, even dominant, role in selecting children for exam schools and gifted and talented programs (Kornhaber, 1997), and they also play an important part in the special education evaluation process (Individuals with Disabilities Education Act, 1997). Although standardized tests are often used in tracking decisions, there is considerable variation in what tests are used. Research suggests that some tests commonly employed for tracking are not valid for this purpose (Darling-Hammond, 1991; Glaser and Silver, 1994; Meisels, 1989; Shepard, 1991) but that other standardized tests are. Although test use varies with regard to tracking, certain test uses are inconsistent with sound psychometric practice and with sound educational policy. These include: using tests not valid for tracking purposes; relying exclusively on test scores in making placement decisions; relying on a test in one subject for placement in other subjects—which in secondary schools may occur indirectly when placement in one class, combined

    OCR for page 273
    --> objectives—obtaining valid information while still testing all English-language learners—create a sizable policy tension for the design of assessment systems, particularly when they involve high stakes. Recommendation: Systematic research that investigates the impact of specific accommodations on the test performance of both English-language learners and other students is needed. Accommodations should be investigated to see whether they reduce construct-irrelevant sources of variance for English-language learners without disadvantaging other students who do not receive accommodations. The relationship of test accommodations to instructional accommodations should also be studied. Recommendation: Development and implementation of alternative measures, such as primary-language assessments, should be accompanied by information regarding the validity, reliability, and comparability of scores on primary-language and English assessments. A sufficient number of English-language learners should be included when items are developed and pilot-tested and in the norming of assessments (Hambleton and Kanjee, 1994). Experts in the assessment of English-language learners might work with test developers to maintain the content difficulty of items while making the language of the instructions as well as actual test items more comprehensible. These modifications would have to be accomplished without making the assessment invalid for other students. Recommendation: The learning and language needs of English-language learners should be considered during test development. Various strategies can be used to obtain valid information about the achievement of English-language learners in large-scale assessments. These include native-language assessments and modifications that decrease the English-language load. Such strategies, however, are often employed inconsistently from place to place and from student to student. Monitoring of educational outcomes for English-language learners as a group is needed to determine the intended and unintended consequences of their participation in large-scale assessments.

    OCR for page 273
    --> Recommendation: Policy decisions about how individual English-language learners will participate in large-scale assessments—such as the language and accommodations to be used—should balance the demands of political accountability with professional standards of good testing practice. These standards require evidence that such accommodations or alternate forms of assessment lead to valid inferences regarding performance. Recommendation: States, school districts, and schools should report and interpret disaggregated assessment scores of English-language learners when psychometrically sound for the purpose of analyzing their educational outcomes. In addition, the role of the test score in decision making needs careful consideration when its meaning is uncertain. For example, invalid low scores on the test may lead to inappropriate placement in treatments that have not been demonstrated to be effective. Multiple sources of information should be used to supplement test score data obtained from large-scale assessment of students who are not language proficient, particularly when decisions will be made about individual students on the basis of the test (American Educational Research Association et al., 1985). Recommendation: Placement decisions based on tests should incorporate information about educational accomplishments, particularly literacy skills, in the primary language. Certification tests (e.g., for high school graduation) should be designed to reflect state or local deliberations and decisions about the role of English-language proficiency in the construct to be assessed. This allows for subject-matter assessment in English only, in the primary language, or using a test that accommodates English-language learners by providing English-language assistance, primary language support, or both. Recommendation: As for all learners, interpretation of the test scores of English-language learners for promotion or graduation should be accompanied by information about opportunities to master the material tested. For English-language learners, this includes information about educational history, exposure to instruction in the primary language and in English, language resources in the home, and exposure to the mainstream curriculum.

    OCR for page 273
    --> Potential Strategies for Promoting Appropriate Test Use The two existing mechanisms for promoting and enforcing appropriate test use—professional standards and legal enforcement—are important but inadequate. The Joint Standards, and the Code of Fair Testing Practices in Education, ethical codes of the testing profession, are written in broad terms and are not always easy to interpret in particular situations. In addition, enforcement of the Joint Standards and the Code depends chiefly on professional judgment and goodwill. Moreover, professional self-regulation does not cover the behavior of individuals outside the testing profession. Many users of educational test results—policymakers, school administrators, and teachers—are unaware of the Joint Standards and are untrained in appropriate test use (Office of Technology Assessment, 1992). Litigation, the other existing mechanism, also has limitations. Most of the pertinent statutes and regulations protect only certain groups of students, and most court decisions are not binding everywhere. Court decisions in different jurisdictions sometimes contradict one another (Larry P. v. Riles, 1984; Parents in Action on Special Education v. Hannon, 1980). Some courts insist that educators observe the principles of test use in the Joint Standards (Office of Technology Assessment, 1992:73–74) and others do not (United States v. South Carolina, 1977). And court challenges are often expensive, divisive, and time-consuming. In sum, federal law is a patchwork of rules rather than a coherent set of norms governing proper test use, and enforcement is similarly uneven. The committee has explored four possible alternative mechanisms that have been applied to problems similar to that of improper test use and about which empirical literature exists. It offers these as alternatives, some less coercive and others more so, that could supplement professional standards and litigation as means of promoting and enforcing appropriate test use. • Deliberative forums: In these forums, citizens would meet with policymakers to discuss and make important decisions about testing. In this model, all participants have equal standing and are more likely to accept decisions, even those with which they disagree, because they feel that they have had an opportunity to influence the outcome. All parties with a stake in assessments would be represented. Such discussions could

    OCR for page 273
    --> help define what constitutes "educational quality" and "achievement to high standards," the role that tests should play in shaping and measuring progress toward those goals, and the level of measurement error that is acceptable where test scores are used in making high-stakes decisions about students. Broad public interest in testing makes this a good time to consider the establishment of such forums. We note, however, the importance of considering potential limitations of this strategy, including: a scarcity of successful examples, the reluctance of those with authority to part with it, and the large amounts of time and patience it would require. • An independent oversight body: George Madaus and colleagues have proposed creating an independent organization to monitor and audit high-stakes testing programs (Madaus, 1992; Madaus et al., 1997). It would not have regulatory powers but would provide information to the public about tests and their use, highlighting best practices in testing. It could supplement a labeling strategy (see below) by educating policymakers, practitioners, and the public about test practice. It could deter inappropriate test use by creating adverse publicity (Ernest House, personal communication 1998). The shortcomings of this proposal include the monitoring body's lack of formal authority to require test publishers or school administrators to submit testing programs for review. Similarly, test users would be under no obligation to accept the body's judgments. It will be important for policymakers interested in the work of such a body to prevent unintended negative consequences. • Labeling: Test producers could be required to report to test users about the appropriate uses and limitations of their tests. A second target of a labeling strategy would be test consumers: parents, students, the public, and the media. Relevant information could include the purpose of the test, intended uses of individuals' scores, consequences for individual students, steps taken to validate the test for its intended use, evidence that the test measures what students have been taught, other information used with test scores to make decisions about individual students, and options for questioning decisions based on test scores. Limitations of this strategy include limited data on its effectiveness, the obstacles many parents face when they seek to challenge policies and actions with which they disagree, and the ineffectiveness of test labeling when the real problem is poor instruction rather than improper test use.

    OCR for page 273
    --> • Federal regulation: Federal statutes could be amended to include standards of appropriate test use. Title I regulations could be revised to ensure that large-scale assessments comply with established professional standards. State Title I plans could address the extent to which state and local assessment systems meet these professional norms. Title VI of the Civil Rights Act of 1964 and Title IX of the Education Amendment of 1972 prohibit federal fund recipients from discriminating on the basis of race, national origin, or sex; both have been cited in disputes about tests that carry high stakes for students. Under existing regulations, when a test has disproportionate adverse impact, the recipient of federal funds must demonstrate that the test and its use are an "educational necessity." Federal regulations do not, however, define this term. Thus, federal regulations could define educational necessity in terms of compliance with professional testing standards. The advantages of this strategy would include its applying to all 50 states and virtually all school districts. Federal regulations could also be a powerful tool for educating policymakers and the public about appropriate test use. And relying on them would make use of existing administrative and judicial mechanisms to promote adherence to testing standards that are rarely enforced. The risks of the regulatory approach are that the sanctions available for failure to comply—cutting off federal aid—make it unwieldy. Moreover, there is political resistance to federal regulation, creating the risk of a backlash that would make it more difficult for the U.S. Department of Education to guide local practice in testing and other areas. This strategy would also be subject to many of the usual disadvantages of administrative and judicial enforcement. The committee is not recommending adoption of any particular strategy or combination of strategies, nor does it suggest that these four approaches are the only possibilities. We do think, however, that ensuring proper test use will require multiple strategies. Given the inadequacy of current methods, practices, and safeguards, further research is needed on these and other policy options to illuminate their possible effects on test use. In particular, we would suggest empirical research on the effects of these strategies, individually and in combination, on testing products and practice, and an examination of the associated potential benefits and risks.

    OCR for page 273
    --> References American Educational Research Association, American Psychological Association, and National Council on Measurement in Education 1985 Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association 1998 Draft Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association American Federation of Teachers 1997 Passing on Failure: District Promotion Policies and Practices. Washington, DC: American Federation of Teachers. Anderson, D.K. 1994 Paths Through Secondary Education: Race/Ethnic and Gender Differences. Unpublished doctoral thesis, University of Wisconsin-Madison. Bishop, John H. 1997 Do Curriculum-Based External Exit Exam Systems Enhance Student Achievement? New York: Consortium for Policy Research in Education and Center for Advanced Human Resource Studies, Cornell University. Bond, Linda A., and Diane King 1995 State High School Graduation Testing: Status and Recommendations. North Central Regional Educational Laboratory. Catterall, James S. 1990 A Reform Cooled-Out: Competency Tests Required for High School Graduation. CSE Technical Report 320. UCLA Center for Research on Evaluation, Standards, and Student Assessment. Cawthorne, John E. 1990 "Tough" Graduation Standards and "Good" Kids. Chestnut Hill, MA: Boston College, Center for the Study of Testing, Evaluation and Educational Policy. Ceci, Stephen J., Tina B. Rosenblum, and Matthew Kumpf 1998 The shrinking gap between high- and low-scoring groups: Current trends and possible causes. Pp. 287–302 in The Rising Curve, Ulric Neisser, ed. Washington, DC: APA Books. Clinton, W.J. 1998 Memorandum to the Secretary of Education. Press release. Washington, DC: The White House. Council of Chief State School Officers 1998 Survey of State Student Assessment Programs. Washington, DC: Council of Chief State School Officers. Darling-Hammond, L. 1991 The implications of testing policy for quality and equality. Phi Delta Kappan 73(3):220–225. Darling-Hammond, L., and A. Wise 1985 Beyond standardization: State standards and school improvement. Elementary School Journal 85(3). Delany, B. 1991 Allocation, choice, and stratification within high schools: How the sorting machine copes. American Journal of Education 99(2):181–207.

    OCR for page 273
    --> Gamoran, A. 1988 A Multi-level Analysis of the Effects of Tracking. Paper presented at the annual meeting, American Sociological Association, Atlanta, GA. Gamoran, A., and M. Weinstein 1998 Differentiation and opportunity in restructured schools. American Journal of Education 106:385–415. Glaser, R., and E. Silver 1994 Assessment, Testing, and Instruction: Retrospect and Prospect. Los Angeles, CA: National Center for Research on Evaluation, Standards, and Student Testing. Grissmer, David, Stephanie Williamson, Sheila N. Kirby, and Mark Berends 1998 Exploring the rapid rise in black achievement scores in the United States (1970–1990). Pp. 251–285 in The Rising Curve, Ulric Neisser, ed. Washington, DC: APA Books. Grissom, J.B., and L.A. Shepard 1989 Repeating and dropping out of school. Pp. 34–63 in Flunking Grades: Research and Policies on Retention, L.A. Shepard and M.L. Smith, eds. London: Falmer Press. Hambleton, R.K., and A. Kanjee 1994 Enhancing the validity of cross-cultural studies: Improvements in instrument translation methods. In International Encyclopedia of Education (2nd Ed.), T. Husen and T.N. Postlewaite, eds. Oxford, UK: Pergamon Press. Hauser, R.M. 1997 Indicators of high school completion and dropout. Pp. 152–184 in Indicators of Children's Well-Being, R.M. Hauser, B.V. Brown, and W.R. Prosser, eds. New York: Russell Sage Foundation. 1998 Trends in black-white test score differentials: 1. Uses and misuses of NAEP/SAT data. Pp. 219–249 in The Rising Curve, Ulric Neisser, ed. Washington, DC: APA Books. Hochschild, J., and B. Scott 1998 Trends: Governance and reform of public education in the United States. Public Opinion Quarterly 62(1):79–120. Holmes, C.T. 1989 Grade level retention effects: A meta-analysis of research studies. Pp. 16–33 in Flunking Grades: Research and Policies on Retention, L.A. Shepard and M.L. Smith, eds. London: Falmer Press. Hoover, H.D., A.N. Hieronymus, D.A. Frisbie, et al. 1994 Interpretive Guide for School Administrators: Iowa Test of Basic Skills, Levels 5–14. University of Iowa: Riverside Publishing Company. Huang, Min-Hsiung, and Robert M. Hauser 1998 Trends in black-white test-score differentials: 1. The WORDSUM Vocabulary Test. Pp. 303–332 in The Rising Curve, Ulric Neisser, ed. Washington, DC: APA Books. Jaeger, Richard M. 1989 Certification of student competence. In Educational Measurement, 3rd Ed. Robert L. Linn, ed. New York: Macmillan.

    OCR for page 273
    --> Johnson, J., and J. Immerwahr 1994 First Things First: What Americans Expect from the Public Schools. New York: Public Agenda. Joint Committee on Testing Practices 1988 Code of Fair Testing Practices. Washington, DC: National Council on Measurement in Education. Koretz, D.M., R.L. Linn, S.B. Dunbar, and L.A. Shepard 1991 The Effects of High-Stakes Testing on Achievement: Preliminary Findings About Generalization Across Tests. Paper presented at the annual meeting of the American Educational Research Association and the National Council on Measurement in Education. Chicago, IL (April). Kornhaber, M. 1997 Seeking Strengths: Equitable Identification for Gifted Education and the Theory of Multiple Intelligences. Doctoral dissertation, Harvard Graduate School of Education. Kreitzer, A.E., F. Madaus, and W. Haney 1989 Competency testing and dropouts. Pp 129–152 in Dropouts from School: Issues, Dilemmas and Solutions, L. Weis, E. Farrar, and H. G. Petrie, eds. Albany: State University of New York Press. Levin, H. 1988 Accelerated Schools for At-risk Students. New Brunswick, NJ: Center for Policy Research in Education. Linn, Robert 1998a Assessments and Accountability. Paper presented at the annual meeting of the American Educational Research Association, April, San Diego . 1998b Validating inferences from National Assessment of Educational Progress achievement-level setting. Applied Measurement in Education 11(1):23–47. Lucas, S. in press Stratification Stubborn and Submerged: Inequality in School After the Unremarked Revolution. New York: Teachers College Press. Luppescu, S., A.S. Bryk, P. Deabster, et al. 1995 School Reform, Retention Policy, and Student Achievement Gains. Chicago, IL: Consortium on Chicago School Research. Madaus, G.F., W. Haney, K.B. Newton, and A.E. Kreitzer 1993 A Proposal for a Monitoring Body for Tests Used in Public Policy. Boston: Center for the Study of Testing, Evaluation, and Public Policy. 1997 A Proposal to Reconstitute the National Commission on Testing and Public Policy as An Independent, Monitoring Agency for Educational Testing. Boston: Center for the Study of Testing Evaluation and Educational Policy. Madaus, G.F., and T. Kellaghan 1991 Examination Systems in the European Community: Implications for a National Examination System in the U.S. Paper prepared for the Science, Education and Transportation Program, Office of Technology Assessment, U.S. Congress, Washington, DC.

    OCR for page 273
    --> McKnight, C.C., and F.J. Crosswhite, J.A. Dossey, E. Kifer, S.O. Swafford, K. Travers, and T.J. Cooney 1987 The Underachieving Curriculum: Assessing U.S. School Mathematics from an International Perspective. Champaign, IL: Stipes Publishing. Meisels, S.J. 1989 Testing, Tracking, and Retaining Young Children: An Analysis of Research and Social Policy. Commissioned paper for the National Center for Education Statistics. Mosteller, F., R. Light, and J. Sachs 1996 Sustained inquiry in education: Lessons from skill grouping and class size. Harvard Educational Review 66(4):797–843. National Academy of Education 1996 Quality and Utility: The 1994 Trial State Assessment in Reading, Robert Glaser, Robert Linn, and George Bohrnstedt, eds. Panel on the Evaluation of the NAEP Trial State Assessment. Stanford, CA: National Academy of Education. National Research Council 1982 Placing Children in Special Education: A Strategy for Equity, K.A. Heller, W.H. Holtzman, and S. Messick, eds. Committee on Child Development Research and Public Policy, National Research Council. Washington, DC: National Academy Press. 1997 Educating One and All: Students with Disabilities and Standards-Based Reform, L.M. McDonnell, M.L. McLaughlin, and P. Morison, eds. Committee on Goals 2000 and the Inclusion of Students with Disabilities, Board on Testing and Assessment. Washington, DC: National Academy Press. National Research Council and Institute of Medicine 1997 Improving Schooling for Language-Minority Children, Diane August and Kenji Hakuta, eds. Board on Children, Youth, and Families. Washington, DC: National Academy Press. O'Day, Jennifer A., and Marshall S. Smith 1993 Systemic reform and educational opportunity. In Designing Coherent Educational Policy, Susan H. Fuhrman, ed. San Francisco: Jossey-Bass. Oakes, J., A. Gamoran, and R. Page 1992 Curriculum differentiation: Opportunities, outcomes, and meanings. Handbook of Research on Curriculum, P. Jackson, ed. New York: MacMillan. Office of Technology Assessment 1992 Testing in American Schools: Asking the Right Questions. OTA-SET-519. Washington, DC: U.S. Government Printing Office. Olson, J.F., and A.A. Goldstein 1997 The Inclusion of Students with Disabilities and Limited English Proficient Students in Large-scale Assessments: A Summary of Recent Progress . NCES 97–482. Washington, DC: U.S. Department of Education, Office of Educational Research and Improvement. Phillips, S.E. 1993 Testing accommodations for disabled students. Education Law Reporter 80:9–32.

    OCR for page 273
    --> 1994 High-stakes testing accommodations. Validity versus disabled rights. Applied Measurement in Education 7(2):93–120. Reardon, Sean F. 1996 Eighth Grade Minimum Competency Testing and Early High School Dropout Patterns. Paper presented at the annual meeting of the American Educational Research Association, New York, April. Selvin, M.J., J. Oakes, S. Hare, K. Ramsey, and D. Schoeff 1990 Who Gets What and Why: Curriculum Decisionmaking at 3 Comprehensive High Schools. Santa Monica, CA: Rand. Shepard, L.A. 1991 Negative policies for dealing with diversity: When does assessment and diagnosis turn into sorting and segregation? Literacy for a Diverse Society: Perspectives, Practices and Policies, E. Hiebert, ed. New York: Teachers College Press. Shepard, L., et al. 1993 Evaluating test validity. Review of Research in Education 19:405–450. Shepard, L., S. Kagan, and E. Wurtz, eds. 1998 Principles and Recommendations for Early Childhood Assessments. Washington DC: National Education Goals Panel. Shepard, L.A., and M.L. Smith 1989 Academic and emotional effects of kindergarten retention in one school district. Pp. 79–107 in Flunking Grades: Research and Policies on Retention, L.A. Shepard and M.L. Smith, eds. London: Falmer Press. Slavin, R.E., et al. 1996 Every Child, Every School: Success for All. Thousand Oaks, CA: Corwin Press. Thurlow, M.L., J.E. Ysseldyke, and B. Silverstein 1993 Testing Accommodations for Students with Disabilities: A Review of the Literature. Synthesis Report 4. Minneapolis, MN: National Center on Educational Outcomes, University of Minnesota. Welner, K.G., and J. Oakes 1996 (Li)Ability grouping: The new susceptibility of school tracking systems to legal challenges. Harvard Educational Review 66(3):451–70. White, P., A. Gamoran, J. Smithson, and A. Porter 1996 Upgrading the high school math curriculum: Math course-taking patterns in seven high schools in California and New York. Educational Evaluation and Policy Analysis 18(4):285–307. Willingham, W.W. 1988 Introduction. Pp. 1–16 in Testing Handicapped People, W.W. Willingham, M. Ragosta, R.E. Bennett, H. Braun, D.A. Rock, and D.E. Powers, eds. Boston, MA: Allyn and Bacon. Legal References Debra P. v. Turlington, 474 F. Supp. 244 (M.D. Fla. 1979); aff'd in part and rev'd in part, 644 F.2d 397 (5th Cir. 1981); rem'd, 564 F. Supp. 177 (M.D. Fla. 1983); aff'd, 730 F.2d 1405 (11th Cir. 1984).

    OCR for page 273
    --> Goals 2000, Educate America Act, 20 U.S.C. sections 5801 et seq. Improving America's Schools Act, 1994. Individuals with Disabilities Education Act, 20 U.S.C. section 1401 et seq. Larry P. v. Riles, 495 F. Supp. 926 (N.D. Cal. 1979); aff'd, 793 F.2d 969 (9th Cir. 1984). Parents in Action on Special Education (PASE) v. Hannon, 506 F. Supp. 831 (N.D. Ill. 1980). Title I, Elementary and Secondary Education Act, 20 U.S.C. sections 6301 et seq. United States v. South Carolina