Findings and Recommendations
The Congress asked the National Academy of Sciences to "conduct a study and make written recommendations on appropriate methods, practices and safeguards to ensure that—
- existing and new tests that are used to assess student performance are not used in a discriminatory manner or inappropriately for student promotion, tracking or graduation; and
- existing and new tests adequately assess student reading and mathematics comprehension in the form most likely to yield accurate information regarding student achievement of reading and mathematics skills."
Congressional interest in this subject stems from the widespread movement in the United States for standards-based school reform, from the consideration of voluntary national tests, and from the increased reliance on achievement tests for various forms of accountability: for school systems, individual schools, administrators, teachers, and students. Moreover, there are sustained high levels of public support for high-stakes testing of individual students, even if it would lead to lower rates of promotion and high school graduation (Johnson and Immerwahr, 1994; Hochschild and Scott, 1998). Because large-scale testing is increasingly used for high-stakes purposes to make decisions that significantly affect the life chances of individual students, the Congress has asked the National
Academy of Sciences, through its National Research Council, for guidance in the appropriate and nondiscriminatory use of such tests.
This study focuses on tests that, by virtue of their use for promotion, tracking, or graduation, have high stakes for individual students. The committee recognizes that accountability for students is related in important ways to accountability for educators, schools, and school districts. This report does not address accountability at those other levels, apart from the issue of participation of all students in large-scale assessments. The report is intended to apply to all schools and school systems in which tests are used for student promotion, tracking, or graduation.
Test form (as mentioned in part B of the congressional mandate) could refer to a wide range of issues, including, for example, the balance of multiple-choice and constructed-response items, the use of student portfolios, the length and timing of the test, the availability of calculators or manipulatives, and the language of administration. However, in considering test form, the committee has chosen to focus on the needs of English-language learners and students with disabilities, in part because these students may be particularly vulnerable to the negative consequences of large-scale assessments. We consider, for these students, in what form and manner a test is most likely to measure accurately a student's achievement of reading and mathematics skills.
Two policy objectives are key for these special populations. One is to increase their participation in large-scale assessments, so that school systems can be held accountable for their educational progress. The other is to test each such student in a manner that accommodates for a disability or limited English proficiency to the extent that either is unrelated to the subject matter being tested, while still maintaining the validity and comparability of test results among all students. These objectives are in tension, and thus present serious technical and operational challenges to test developers and users.
Assessing the Uses of Tests
In its deliberations the committee has assumed that the use of tests in decisions about student promotion, tracking, or graduation is intended to serve educational policy goals, such as setting high standards for student learning, raising student achievement levels, ensuring equal educational opportunity, fostering parental involvement in student learning, and increasing public support for the schools.
Determining whether the use of tests for student promotion, tracking, or graduation produces better overall educational outcomes requires that the various intended benefits of high-stakes test use be weighed against unintended negative consequences for individual students and groups of students. The costs and benefits of testing should also be balanced against those of making high-stakes decisions about students in other ways, using criteria other than test scores; decisions about tracking, promotion, and graduation will be made with or without information from standardized tests. The committee recognizes that test use may have negative consequences for individual students even while serving important social or educational policy purposes. We believe that the development of a comprehensive testing policy should be sensitive to the balance among individual and collective benefits and costs.
The committee follows an earlier work by the National Research Council (1982) in adopting a three-part framework for determining whether a planned or actual test use is appropriate. The three principal criteria are (1) measurement validity—whether a test is valid for a particular purpose and the constructs measured have been correctly chosen; (2) attribution of cause—whether a student's performance on a test reflects knowledge and skill based on appropriate instruction or is attributable to poor instruction or to such factors as language barriers or construct-irrelevant disabilities; and (3) effectiveness of treatment—whether test scores lead to placements and other consequences that are educationally beneficial. This framework leads us to emphasize several basic principles of appropriate test use.
First, the important thing about a test is not its validity in general, but its validity when used for a specific purpose. Thus, tests that are useful in leading the curriculum or in school accountability are not appropriate for use in making high-stakes decisions about individual student mastery unless the curriculum, the teaching, and the tests are aligned.
Second, tests are not perfect. Test questions are a sample of possible questions that could be asked in a given area. Moreover, a test score is not an exact measure of a student's knowledge or skills. A student's score can be expected to vary across different versions of a test—within a margin of error determined by the reliability of the test—as a function of the particular sample of questions asked and/or transitory factors, such as the health of the student on the day of the test.
Third, an educational decision that will have a major impact on a test
taker should not solely or automatically be made on the basis of a single test score. Other relevant information about the student's knowledge and skills should also be taken into account.
Finally, neither a test score nor any other kind of information can justify a bad decision. For example, research shows that tracking, as typically practiced, harms students placed in low-track classes. In the absence of better treatments, better tests will not lead to better educational outcomes. Throughout the report, the committee has considered how these principles apply to the appropriate use of tests in decisions about tracking, promotion, and graduation and to possible uses of the proposed voluntary national tests.
Blanket criticisms of testing and assessment are not justified. When tests are used in ways that meet relevant psychometric, legal, and educational standards, students' scores provide important information that, combined with information from other sources, can lead to decisions that promote student learning and equality of opportunity (Office of Technology Assessment, 1992). For example, tests can identify learning differences among students that the education system needs to address. Because decisions about tracking, promotion, and graduation will be made with or without testing, proposed alternatives to testing should be at least equally accurate, efficient, and fair.
It is also a mistake to accept observed test scores as either infallible or immutable. When test use is inappropriate, especially in the case of high-stakes decisions about individuals, it can undermine the quality of education and equality of opportunity. For example, it is wrong to suggest that the lower achievement test scores of racial and ethnic minorities and students from low-income families reflect inalterable realities of American society.1 Such scores reflect persistent inequalities in American society and its schools, and the inappropriate use of test scores can legitimate and reinforce these inequalities. This lends a special urgency to the requirement that test use in connection with tracking, promotion, and graduation should be appropriate and fair. With respect to the use of tests in making high-stakes decisions about students, the committee concludes that statements about the benefits and harms of testing often go beyond what the evidence will support.
In important ways, educational decisions about tracking, promotion, and graduation are different from one another. They differ most importantly in the role that mastery of past material and readiness for new material play as decision-making criteria and in the importance of beneficial educational placement relative to certification as consequences of the decision. Thus, we have considered the role of large-scale, high-stakes testing separately in relation to each type of decision. However, tracking, promotion, and graduation also share common features that pertain to appropriate test use and to their educational and social consequences. These include the alignment between testing and the curriculum, the social and economic sorting that follows from the decisions, the range of educational options potentially linked to the decisions, the use of multiple sources of evidence, the use of tests among young children, and improper manipulation of test score outcomes for groups or individuals. Even though we also raise some of these issues in connection with specific decisions, each of them cuts across two or more types of decisions. We therefore discuss them jointly in this section before turning separately to the use of tests in tracking, promotion, and graduation decisions.
It is a mistake to begin educational reform by introducing tests with high stakes for individual students. If tests are to be used for high-stakes decisions about individual mastery, such use should follow implementation of changes in teaching and curriculum that ensure that students have been taught the knowledge and skills on which they will be tested. Some school systems are already doing this by planning a gap of several years between the introduction of new tests and the attachment of high stakes to individual student performance, during which schools may achieve the necessary alignment among tests, curriculum, and instruction. Others may see high-stakes student testing as a way of leading curricular reform, not recognizing the danger that a test may lack the "instructional validity" required by law (Debra P. v. Turlington, 1981)—that is, a close correspondence between test content and instructional content.
To the extent that all students are expected to meet "world-class" standards, there is a need to provide world-class curricula and instruction to all students. However, in most of the nation, much needs to be done before a world-class curriculum and world-class instruction will be in place (National Academy of Education, 1996). At present, curriculum does not usually place sufficient emphasis on student understanding and
application of concepts, as opposed to memorization and skill mastery. In addition, instruction in core subjects typically has been and remains highly stratified. What teachers teach and what students learn vary widely by track, with those in lower tracks receiving far less than a world-class curriculum. If world-class standards were suddenly adopted, student failure would be unacceptably high (Linn, 1998a).
Recommendation: Accountability for educational outcomes should be a shared responsibility of states, school districts, public officials, educators, parents, and students. High standards cannot be established and maintained merely by imposing them on students.
Recommendation: If parents, educators, public officials, and others who share responsibility for educational outcomes are to discharge their responsibility effectively, they should have access to information about the nature and interpretation of tests and test scores. Such information should be made available to the public and should be incorporated into teacher education and into educational programs for principals, administrators, public officials, and others.
Recommendation: A test may appropriately be used to lead curricular reform, but it should not also be used to make high-stakes decisions about individual students until test users can show that the test measures what they have been taught.
The consequences of high-stakes testing for individual students are often posed as a either-or propositions, but this need not be the case. For example, social promotion and simple retention in grade are really only two of many educational strategies available to educators when test scores and other information indicate that students are experiencing serious academic difficulty. Neither social promotion nor retention alone is an effective treatment, and schools can use a number of possible strategies to reduce the need for these either-or choices—for example, by coupling early identification of such students with effective remedial education. Similar observations hold for decisions about tracking and about high school graduation.
Recommendation: Test users should avoid simple either-or options when high-stakes tests and other indicators show that
students are doing poorly in school, in favor of strategies combining early intervention and effective remediation of learning problems.
Large-scale assessments are used widely to make high-stakes decisions about students, but they are most often used in combination with other information, as recommended by the major professional and scientific organizations in testing (American Educational Research Association et al., 1985, 1998; Joint Committee on Testing Practices, 1988). For example, according to a recent survey, teacher-assigned grades, standardized tests, developmental factors, attendance, and teacher recommendations form the evidence on which most school districts say that they base promotion decisions (American Federation of Teachers, 1997). A test score, like any other source of information about a student, is not exact. It is an estimate of the student's understanding or mastery of the content that a test was intended to measure.
Recommendation: High-stakes decisions such as tracking, promotion, and graduation should not automatically be made on the basis of a single test score but should be buttressed by other relevant information about the student's knowledge and skills, such as grades, teacher recommendations, and extenuating circumstances.
Problems of test validity are greatest among young children, and there is a greater risk of error when such tests are employed to make significant educational decisions about children who are less than 8 years old or below grade 3—or about their schools. However, well-designed assessments may be useful in monitoring trends in the educational development of populations of students who have reached age 5 (Shepard et al., 1998).
Recommendation: In general, large-scale assessments should not be used to make high-stakes decisions about students who are less than 8 years old or enrolled below grade 3.
All students are entitled to sufficient test preparation, but it is not proper to expose them ahead of time to items that will actually be used on their test or to give them the answers to those questions. Test results may also be invalidated by teaching so narrowly to the objectives of a particular test that scores are raised without actually improving the broader set of academic skills that the test is intended to measure (Koretz et al.,
1991). The committee also recognizes that the desirability of "teaching to the test" is affected by test design. For example, it is entirely appropriate to prepare students by covering all the objectives of a test that represents the full range of the intended curriculum. Thus, it is important that test users respect the distinction between genuine remedial education and teaching narrowly to the specific content of a test.
Recommendation: All students are entitled to sufficient test preparation so their performance will not be adversely affected by unfamiliarity with item format or by ignorance of appropriate test-taking strategies. Test users should balance efforts to prepare students for a particular test format against the possibility that excessively narrow preparation will invalidate test outcomes.
There is an inherent conflict of interest when teachers administer high-stakes tests to their own students or score their own students' exams. On one hand, teachers want valid information about how well their students are performing. On the other hand, there is often substantial external pressure on teachers (as well as principals and other school personnel) for their students to earn high scores. This external pressure may lead some teachers to provide inappropriate assistance to their students before and during the test administration or to mis-score exams. The prevalence of such inappropriate practices varies among and within states and schools. Consequently, when there is evidence of a problem, such as from observations or other data, formal steps should be taken to ensure the validity of the scores obtained. This could include having an external monitoring system with sanctions, or having someone external to the school administer the tests and ensure their security, or both. Only in this way can the scores obtained from high-stakes tests be trusted as providing reasonably accurate results regarding student performance.
Members of some minority groups, English-language learners, and low-socioeconomic (SES) students are overrepresented in lower-track classes and among those denied promotion or graduation on the basis of test scores. Moreover, these same groups of students are underrepresented in high-track classes, "exam" schools, and "gifted and talented" programs (Oakes et al., 1992). In some cases, such as courses for English-language learners, such disproportions are not problematic. We would not expect to find native English speakers in classes designed to teach English to English-language learners.
In other circumstances, such disproportions raise serious questions.
For example, although the grade placement of 6-year-olds is similar among boys and girls and among racial and ethnic groups, grade retardation among children cumulates rapidly after age 6, and it occurs disproportionately among males and minority group members. Among children 6 to 8 years old in 1987, 17 percent of white females and 22 percent of black males were enrolled below the modal grade for their age. By ages 9 to 11, 22 percent of white females and 37 percent of black males were enrolled below the modal grade for their age. In 1996, when the same children were 15 to 17 years old, 29 percent of white females and 48 percent of black males were either enrolled below the modal grade level for their age or had dropped out of school (U.S. Bureau of the Census, Current Population Reports, Series P-20). These disproportions are especially disturbing in view of other evidence that grade retention and assignment to low tracks have little educational value.
The concentrations of minority students, English-language learners, and low-SES students among those retained in grade, denied high school diplomas, and placed in less demanding classes raise significant questions about the efficacy of schooling and the fairness of major educational decisions, including those made using information from high-stakes tests.
The committee sees a strong need for better evidence on the benefits and costs of high-stakes testing. This evidence should tell us whether the educational consequences of particular decisions are educationally beneficial for students, e.g., by increasing academic achievement or reducing school dropout. It is also important to develop statistical reporting systems of key indicators that will track both intended effects (e.g., higher test scores) and other effects (e.g., changes in dropout or special education referral rates). For example, some parents or educators may improperly seek to classify their students as disabled in order to take advantage of accommodation in high-stakes tests. Indicator systems could include measures such as retention rates, special education identification rates, rates of exclusion from assessment programs, number and type of accommodations, high school completion credentials, dropout rates, and indicators of access to high-quality curriculum and instruction.
Recommendation: High-stakes testing programs should routinely include a well-designed evaluation component. Policymakers should monitor both the intended and unintended consequences of high-stakes assessments on all students and on significant subgroups of students, including minorities, English-language learners, and students with disabilities.
Appropriate Uses of Tests in Tracking, Promotion, and Graduation
The intended purpose of tracking is to place each student in an educational setting that is optimal given his or her knowledge, skills, and interests. Support for tracking stems from a widespread belief that students will perform optimally if they receive instruction in homogeneous classes and schools, in which the pace and nature of instructions are tailored to their achievement levels (Oakes et al., 1992; Gamoran and Weinstein, 1998). The research evidence on this point, however, is unclear (Mosteller et al., 1996).
"Tracking" takes many different forms, including: (1) grouping between classes within a grade level based on perceived achievement or skill level; (2) selection for exam schools or gifted and talented programs; (3) identification for remedial education programs, such as "intervention" schools; and (4) referral for possible placement in special education (Oakes et al., 1992; Mosteller et al., 1996). Tracking is common in American schools, but tracking policies and practices vary, not only from state to state and district to district, but also from school to school. Because tracking policies and procedures are both diverse and decentralized, it is difficult to generalize about the use of tests in tracking.
Research suggests that (1) as a result of tracking, the difference in average achievement of students in different classes in the same school is far greater in the United States than in most other countries (Linn, 1998a); (2) instruction in low-track classes is far less demanding than in high-track classes (Welner and Oakes, 1996; McKnight et al., 1987); (3) students in low-track classes do not have the opportunity to acquire knowledge and skills strongly associated with future success; and (4) many students in low-track classes would acquire such knowledge and skills if placed in more demanding educational settings (Slavin et al., 1996; Levin, 1988; Title I of the Elementary and Secondary Education Act).
Recommendation: As tracking is currently practiced, low-track classes are typically characterized by an exclusive focus on basic skills, low expectations, and the least-qualified teachers. Students assigned to low-track classes are worse off than they would be in other placements. This form of tracking should be eliminated. Neither test scores nor other information should be used to place students in such classes.
Some forms of tracking, such as proficiency-based placement in foreign language classes and other classes for which there is a demonstrated need for prerequisites, may be beneficial. We make no attempt here to enumerate all forms of beneficial tracking. The general criterion of what constitutes beneficial tracking is that a student's placement, in comparison with other available placements, yields the greatest chance that the student will acquire the knowledge and skills strongly associated with future success.
The role that tests play in tracking decisions is an important and subtle issue. Educators consistently report that, whereas test scores are routinely used in making tracking decisions, most within-grade tracking decisions are based not solely on test scores but also on students' prior achievement, teacher judgment, and other factors (White et al., 1996; Delany, 1991; Selvin et al., 1990). Research also suggests that "middle class parents intervene to obtain advantageous positions for their children" regardless of test scores or of teacher recommendations (Lucas, in press:206). Nonetheless, even when test scores are just one factor among several that influence tracking decisions, they may carry undue weight by appearing to provide scientific justification and legitimacy for tracking decisions that such decisions would not otherwise have. Some scholars believe that reliance on test scores increases the disproportionate representation of poor and minority students in low-track classes. However, test use can also play a positive role, as when a relatively high test score serves to overcome a negative stereotype (Lucas, in press). Tests may play an important, even dominant, role in selecting children for exam schools and gifted and talented programs (Kornhaber, 1997), and they also play an important part in the special education evaluation process (Individuals with Disabilities Education Act, 1997).
Although standardized tests are often used in tracking decisions, there is considerable variation in what tests are used. Research suggests that some tests commonly employed for tracking are not valid for this purpose (Darling-Hammond, 1991; Glaser and Silver, 1994; Meisels, 1989; Shepard, 1991) but that other standardized tests are.
Although test use varies with regard to tracking, certain test uses are inconsistent with sound psychometric practice and with sound educational policy. These include: using tests not valid for tracking purposes; relying exclusively on test scores in making placement decisions; relying on a test in one subject for placement in other subjects—which in secondary schools may occur indirectly when placement in one class, combined
with scheduling considerations, dictates track placements in other subjects (Oakes et al., 1992; Gamoran, 1988); relying on subject-matter tests in English, without appropriate accommodation, in placing English-language learners in certain classes; and failing to reevaluate students periodically to determine whether existing placements remain suitable. It is also inappropriate to use test scores or any other information as a basis for placing children in settings in which their access to higher-order knowledge and skills is denied or limited.
Recommendation: Since tracking decisions are basically placement decisions, tests and other information used for this purpose should meet professional test standards regarding placement.
Recommendation: Because a key assumption underlying placement decisions is that students will benefit more from certain educational experiences than from others, the standard for using a test or other information to make tracking decisions should be accuracy in predicting the likely educational effects of each of several alternative educational experiences.
Recommendation: If a cutscore is to be employed on a test used in making a tracking or placement decision, the quality of the standard-setting process should be documented and evaluated.
Promotion and Retention
The intended purposes of formal promotion and retention policies are (1) to ensure that students acquire the knowledge and skills they need for successful work in higher grades and (2) to increase student and teacher motivation to succeed. Many states and school districts rely on large-scale assessments, some heavily, in making decisions about student promotion and retention at specified grade levels. In the great majority of states and school districts, promotion and retention decisions are based on a combination of grades, test scores, developmental factors, attendance, and teacher recommendations (American Federation of Teachers, 1997). However, the trend is for more states and school districts to base promotion mainly on test scores.
Much of the current public discussion of high-stakes testing is motivated by calls for an end to social promotion. For example, in the Clinton
administration's proposals for educational reform, an end to social promotion is strongly tied to early identification and remediation of learning problems. The proposal also calls for "appropriate use of tests and other indicators of academic performance in determining whether students should be promoted" (Clinton, 1998:3). The key question is whether testing will be used appropriately in such decisions.
Grade retention policies typically have positive intentions but negative consequences. The intended positive consequences are that students will be more motivated to learn and will consequently acquire the knowledge and skills they need at each grade level. The negative consequences, as grade retention is currently practiced, are that retained students persist in low achievement levels and are likely to drop out of school. Low-performing students who have been retained in kindergarten or primary grades lose ground both academically and socially relative to similar students who have been promoted (Holmes, 1989; Shepard and Smith, 1989). In secondary school, grade retention leads to reduced achievement and much higher rates of school dropout (Luppescu et al., 1995; Grissom and Shepard, 1989; Anderson, 1994). At present, the negative consequences of grade retention policies typically outweigh the intended positive effects. Simple retention in grade is an ineffective intervention.
Social promotion and simple retention in grade are only two of the educational interventions available to educators when students are experiencing serious academic difficulty. Schools can use a number of possible strategies to reduce the need for these either-or choices, for example, by coupling early identification of such students with effective remedial education. In this model, schools would identify early those students whose academic performance is weak and would then provide effective remedial education aimed at helping them to acquire the knowledge and skills needed to progress from grade to grade. The effectiveness of such alternative approaches would depend on the quality of the instruction that students received. It is neither simple nor inexpensive to provide high-quality remedial instruction. In the current political environment, the committee is concerned about the possibility that such remediation may be neglected once higher promotion standards have been imposed.
The committee did not attempt to synthesize research on the effectiveness of interventions that combine the threat of in-grade retention with other treatments, such as tutoring, reduced class size, enrichment classes, and remedial instruction after school, on weekends, or during the summer. Some states and localities are carrying out such research internally.
Pressure for test-based promotion decisions has resulted from the common but mistaken perception that social promotion is the norm. In fact, large numbers of students are retained in grade, and grade retention has increased over most of the past 25 years. For example, the percentage of 6- to 8-year-olds enrolled below the modal grade for their age rose from 11 percent in 1971 to a peak of 22 percent in 1990, and it was 18 percent in 1996. The rise reflects a combination of early grade retention and later school entry. At ages 15 to 17, the percentage enrolled below the modal grade for their age rose from 23 percent in 1971 to 31 percent in 1996 (U.S. Bureau of the Census, Current Population Reports, Series P-20). Thus, about 10 percent of students are held back in school between ages 6 to 8 and ages 15 to 17.
In some places, tests are being used inappropriately in making promotion and retention decisions. For example, achieving a certain test score has become a necessary condition of grade-to-grade promotion. This is inconsistent with current and draft revised psychometric standards, which recommend that such high-stakes decisions about individuals should not automatically be made on the basis of a single test score; other relevant information about the student's knowledge and skill should also be taken into account (American Educational Research Association et al., 1985: Standard 8.12; 1998). It is also inconsistent with the explicit recommendations of test publishers about using tests for retention decisions; for example, as noted by Hoover et al. (1994:12): "A test score from an achievement battery should not be used alone in making such a significant decision." Also, some tests used in making promotion decisions have not been validated for this purpose (Shepard, 1991); for example, tests are sometimes used for promotion decisions without having been aligned with the curriculum in either the current or the higher-level grade.
Recommendation: Scores from large-scale assessments should never be the only sources of information used to make a promotion or retention decision. No single source of information—whether test scores, course grades, or teacher judgments—should stand alone in making promotion decisions. Test scores should always be used in combination with other sources of information about student achievement.
Recommendation: Tests and other information used in promotion decisions should adhere, as appropriate, to psychometric standards for placement and to psychometric standards for certifying knowledge and skill.
Recommendation: Tests and other information used in promotion decisions may be interpreted either as evidence of mastery of material already taught or as evidence of student readiness for material at the next grade level. In the former case, test content should be representative of the curriculum at the current grade level. In the latter case, test scores should predict the likely educational effects of future placements—whether promotion, retention in grade, or some other intervention options.
Recommendation: If a cutscore is to be employed on a test used in making a promotion decision, the quality of the standard-setting process should be documented and evaluated—including the qualification of the judges employed, the method or methods employed, and the degree of consensus reached.
Recommendation: Students who fail should have the opportunity to retake any test used in making promotion decisions; this implies that tests used in making promotion decisions should have alternate forms.
Recommendation: Test users should avoid the simple either-or option to promote or retain in grade when high-stakes tests and other indicators show that students are doing poorly in school, in favor of strategies combining early identification and effective remediation of learning problems.
Awarding or Withholding High School Diplomas
The intended purposes of graduation test requirements are (1) to imbue the high school diploma with some generally recognized meaning, (2) to increase student and teacher motivation, (3) to ensure that students acquire the knowledge and skills they need for successful work or study after high school, and (4) to provide accurate information to parents, educators, and policymakers about student achievement levels. Graduation exams initially focused on basic skills and minimum competencies,
but recently there has been a trend toward graduation tests that assess higher-order skills (American Federation of Teachers, 1997).
In most states, individuals earn high school diplomas based on Carnegie units, which are defined by the number of hours the student has attended class. Because this simply ensures that students have passed certain courses, an imprecise and nonuniform measure of what students know at the end of high school, 18 states require that students also pass a competency exam in order to graduate, usually in addition to satisfactorily completing other requirements for graduation (Council of Chief State School Officers, 1998).
Very little is known about the specific consequences of passing or failing a high school graduation examination, as distinct from earning or not earning a high school diploma for other reasons. We do know that earning a high school diploma is associated with better health and with improved opportunities for employment, earnings, family formation and stability, and civic participation (Hauser, 1997; Jaeger, 1989).
The consequences of using high-stakes tests to grant or withhold high school diplomas may be positive or negative. For example, if high-stakes graduation tests motivate students to work harder in school, the result may be increased learning for those who pass the test and, perhaps, even for those who fail. Similarly, if high-stakes tests give teachers and other local educators guidance on what knowledge and skills are most important for students to learn, that may improve curriculum and instruction. In fact, minimum competency tests to appear to have affected instruction, by increasing the amount of class time spent on basic skills (Darling-Hammond and Wise, 1985; Madaus and Kellaghan, 1991; O'Day and Smith, 1993), but available evidence about the possible effects of graduation tests on learning and on high school dropout is inconclusive (e.g., Kreitzer et al., 1989; Reardon, 1996; Catterall, 1990; Cawthorne, 1990; Bishop, 1997).
If students have not have been exposed to subject matter included on the test—which is more likely to be the case when tests "lead" curricular change or when tests are for some other reason not aligned with curriculum—this may be inconsistent with relevant legal precedents. If failing to achieve a certain score on a standardized test automatically leads to withholding a diploma, this may be inconsistent with current and draft revised psychometric standards, which recommend that other relevant information also be taken into account in such a high-stakes decision.
The current standards-based reform movement, which calls for high
standards for all students, presents states with possible dilemmas when graduation testing is concerned (Bond and King, 1995). First, states must be able to show that students are being taught what the high-standards tests measure. At present, however, advanced skills are often not well defined and ways of assessing them are not well established. Second, there is evidence that graduation tests geared to high performance levels, such as those currently used in the National Assessment of Educational Progress, would result, at least in the short run, in denying diplomas to a large proportion of students (Linn, 1998b).
The committee recognizes that the passing rate on a test is dependent on the choice of a cutoff or cutscore. Moreover, we recognize that setting a relatively high cutscore will probably lead to large differences in passing rates among groups differing by race or ethnicity, socioeconomic status, gender, level of English proficiency, and disability status, unless ways are found to improve educational opportunities for all. This could include using tests to provide early identification of students who are at risk of failing a graduation test and offering them instruction that would be effective in increasing their chances of passing.
An alternate approach to a single, test-based graduation examination would be to require students to pass each of a series of end-of-course exams. Another policy would allow students to offset a low score in one area with a high score in another. A third approach would be to offer "endorsed" diplomas to students who have passed a test without denying a diploma to those who have failed a graduation test but completed all other requirements. The committee did not attempt to synthesize research on the effectiveness of interventions that combine the threat of diploma denial with other treatments, such as remedial instruction after school, on weekends, and during the summer. We do not know how best to combine advance notice of high-stakes test requirements, remedial intervention, and opportunity to retake graduation tests. Research is also needed to explore the effects of different kinds of high school credentials on employment and other post-school outcomes.
Recommendation: High school graduation decisions are inherently certification decisions; the diploma should certify that the student has achieved acceptable levels of learning. Tests and other information used for this purpose should afford each student a fair opportunity to demonstrate the required levels of knowledge and skill in accordance with psychometric standards for certification tests.
Recommendation: Graduation tests should provide evidence of mastery of material taught. Thus, there is a need for evidence that the test content is representative of what students have been taught.
Recommendation: The quality of the process of setting a cutscore on a graduation test should be documented and evaluated—including the qualification of the judges employed, the method or methods employed, and the degree of consensus reached.
Recommendation: Students who are at risk of failing a graduation test should be advised of their situation well in advance and provided with appropriate instruction that would improve their chances of passing.
Recommendation: Research is needed on the effects of high-stakes graduation tests on teaching, learning, and high school completion. Research is also needed on alternatives to test-based denial of the high school diploma, such as endorsed diplomas, end-of-course tests, and combining graduation test scores with other indicators of knowledge and skill in making the graduation decision.
Using the Voluntary National Tests for Tracking, Promotion, or Graduation Decisions
The purpose of the proposed voluntary national tests (VNTs) is to inform students (and their parents and teachers) about the performance of the students in 4th grade reading and 8th grade mathematics relative to the standards of the National Assessment of Educational Progress (NAEP) and the Third International Mathematics and Science Study (TIMSS).
The VNT proposal does not suggest any direct use of the test scores to make decisions about the tracking, promotion, or graduation of individual students. Indeed, representatives of the U.S. Department of Education have stated that the VNT is not intended for use in making such decisions. Nonetheless, some civil rights organizations and other groups have expressed concern that test users would inappropriately use the scores for such purposes. Indeed, under the proposed plan, test users—including states, school districts, and schools—would be free to use the
tests as they pleased, just as test users are now free to use commercial tests for purposes other than those recommended by test developers and publishers. Accordingly—and because this study was requested in the context of the discussion of the VNT—the committee has considered whether it would be appropriate to make tracking, promotion, or graduation decisions about individual students based on their VNT scores.
For tracking decisions, use of VNT scores would necessarily be limited to placement in 5th grade reading and 9th grade mathematics. Moreover, using the scores to make future class placements would be valid only to the extent that the VNT scores were predictive of success in future placements. VNT proficiency levels, which are expected to be the same as those of NAEP, do not correspond well with other common definitions of proficiency: those embodied in current state content and performance standards (Linn, 1998b), those found in such widely used tests as the SAT and advanced placement exams (Shepard et al., 1993), and those used in traditional tracking systems. Indeed, the large share of students who score below the basic level in NAEP has led to justifiable concerns that reports of achievement on the VNT will provide little information about lower levels of academic performance.
For promotion decisions, there is no guarantee that the framework or content of the VNT assessments would be aligned with the curriculum that students have experienced or would experience in the next-higher grade. Moreover, to make reliable distinctions based on the NAEP proficiency levels, the tests should include many items near the levels of difficulty that separate proficiency levels. Even with a focus on the NAEP proficiency levels, some testing experts have been concerned about the accuracy of VNT results, and if the tests are not accurate enough for descriptive purposes, they will surely not be accurate enough to use in making high-stakes decisions about individual students.
The need for multiple versions of a promotion test conflicts with the plan to release all VNT test items and their correct answers. In high-stakes testing situations, demands for fairness, as well as our criteria of validity and reliability in measurement, require that students who fail a promotion test be permitted to retake comparable versions of the test. However, there is no plan to develop or release extra forms of the VNT assessments for use in "second-chance" administrations, and, if such extra forms were developed, this would add to problems of test security.
Similar concerns apply to the use of VNT scores in making high school graduation decisions. It is doubtful that there would be any potential use for the results of a 4th grade reading test in determining an
individual's fitness to receive a high school diploma. Although some states have deemed achievement at the 8th grade level sufficient to meet their graduation standard in mathematics, the lack of alternative test forms to allow students opportunities to retake the test makes the VNT inappropriate for this purpose.
There are clear incompatibilities between features of the VNT that would facilitate its use as a tool for informing students, parents, and teachers about student achievement, on one hand, and possible uses of the scores in making decisions about tracking, promotion, or graduation of individual students, on the other hand.
Recommendation: The voluntary national tests should not be used for decisions about the tracking, promotion, or graduation of individual students.
Recommendation: If the voluntary national tests are implemented, the federal government should issue regulations or guidance to ensure that VNT scores are not used for decisions about the tracking, promotion, or graduation of individual students.
The committee takes no position on whether the VNT is practical or appropriate for its primary stated purposes.
Forms of Testing: Participation and Accommodations
Students with Disabilities
Recent legislative initiatives at both the federal and state levels mandate that all students be included in large-scale assessment programs, including those with special language and learning needs (Goals 2000, 1994; Improving America's School Act, 1994). For students with disabilities, the mandate is particularly strong due to the recently amended Individuals with Disabilities Education Act of 1997 (IDEA), which requires states and districts to provide for such participation as a condition of eligibility. However, in many cases, the demands that full participation of these students place on assessment systems are greater than current assessment knowledge and technology can support.
Participation of students with disabilities in large-scale assessments is important to ensure that schools are held accountable for the educational performance of these students and to obtain a fully representative, accurate
picture of overall student performance. When these assessments are used to make high-stakes decisions about individual students, the potential negative consequences are likely to fall most heavily on groups with special learning needs, such as students with disabilities.
More than 5 million students with disabilities participate in special education programs under the IDEA. They vary widely in the severity of disability, educational goals, and degree of involvement in the general education curriculum. Although federal legislation defines 13 categories of disability, 90 percent of all special education students have one of four disabilities: speech or language impairment, serious emotional disturbance, mental retardation and/or specific learning disabilities. This diversity has important implications for how students with disabilities participate in large-scale assessments: for example, some participate fully in ways that are indistinguishable from their general education peers, some require modifications or accommodations in the testing procedure, and others are exempted from participation entirely (National Research Council, 1997).
For a number of reasons, many students with disabilities have previously not been included in the large-scale assessment programs conducted by their states and districts (National Research Council, 1997). In order for some students with disabilities to participate, accommodations—such as braille versions, alternate settings, extended time, and calculators—will need to be provided during testing. The purpose of accommodations is to correct for the impact of a disability that is unrelated to the subject matter being tested; in essence, the disability interferes with the student's capacity to demonstrate what he or she truly knows about the subject (Willingham, 1988).
Validity will be improved when testing accommodations are designed to correct for distortions in scores caused by specific disabilities. However, accommodations should be independent of the construct being measured (Phillips, 1993, 1994; American Educational Research Association et al., 1985). Determining whether an accommodation is independent of the construct is difficult for some types of disability, especially cognitive disabilities. Moreover, there is little research on how to design accommodations, a problem that is exacerbated by the lack of a reliable taxonomy for describing disabilities. Some strategies, such as computer adaptive testing for students who need extra time, may accommodate a large share of students with special needs without threatening the validity of test results. However, more research and development—along with access to the technology—are needed to bring this and other strategies into widespread use.
Accommodations should therefore be offered for two purposes: (1) to increase the participation of students with disabilities in large-scale assessments and (2) to increase the validity of the test score information. These two objectives—obtaining valid information while still testing all students—create a sizable policy tension for the design of assessment systems, particularly when they involve high stakes.
Recommendation: More research is needed to enable students with disabilities to participate in large-scale assessments in ways that provide valid information. This goal significantly challenges current knowledge and technology about measurement and test design and the infrastructure needed to achieve broad-based participation.
In addition, students with disabilities are rarely included in adequate numbers in the pilot samples when new assessments are being developed; oversampling may be necessary to permit key statistical analyses, such as determining the impact of accommodations on test scores, norm development, and analyses of differential item functioning (Olson and Goldstein, 1997).
Recommendation: The needs of students with disabilities should be considered throughout the test development process.
As the stakes of testing become higher, there is a greater need to establish the validity of tests administered to students with disabilities. At present, policies on the kinds of testing accommodations offered and to whom they are offered vary widely from place to place (Thurlow et al., 1993). New federal regulations require that the individual education program (IEP) document the decisions made about each child's participation in assessments and the type and nature of the accommodations needed. The proportion of students that require accommodations will depend on the purpose, format, and content of the assessment.
Parents of students with disabilities play unique roles as advocates for their children's rights, important participants in the IEP process, and monitors of accountability and enforcement. If high stakes are to be attached to the assessment of students with disabilities, then parents and other members of the IEP team will need to be able to make informed choices about the nature and extent of a student's participation in the assessment and its possible implications for future education and post-school outcomes.
Recommendation: Decisions about how students with disabilities will participate in large-scale assessments should be guided by criteria that are as systematic and objective as possible. They should also be applied on a case-by-case basis as part of the child's individual education program and consistent with the instructional accommodations that the child receives.
Recommendation: If a student with disabilities is subject to an assessment used for promotion or graduation decisions, the IEP team should ensure that the curriculum and instruction received by the student through the individual education program is aligned with test content and that the student has had adequate opportunity to learn the material covered by the test.2
Although the basic principle should be to include all students with disabilities in the large-scale assessments, and to provide accommodations to enable them to do so, some number of students is likely to need to participate in a different or substantially modified assessment; the size of this group will depend on the nature of the assessment and the content being assessed. Obtaining meaningful information about the educational achievement and progress of these students is difficult. However, when the stakes are high, such as in deciding whether a student receives a diploma, it is critical for students who cannot take the test to have alternate ways of demonstrating proficiency. For students whose curriculum differs substantially from the general curriculum, there may also be a need to develop meaningful alternative credentials that can validly convey the nature of the student's accomplishments.
Recommendation: Students who cannot participate in a large-scale assessment should have alternate ways of demonstrating proficiency.
Recommendation: Because a test score may not be a valid representation of the skills and achievement of students with disabilities, high-stakes decisions about these students should consider other sources of evidence such as grades, teacher recommendations, and other examples of student work.
Federal and state mandates increasingly require the inclusion of English-language learners in large-scale assessments of achievement (Goals 2000 [P.L. 103–227], Title I [Helping Disadvantaged Children Meet High Standards] and Title VII [Bilingual Education] of the Improving America's Schools Act of 1994 [P.L. 103–382]). In particular, high-stakes tests are used with English-language learners for decisions related to tracking, promotion, and graduation, as well as for system-wide accountability. The demands that full participation of English-language learners make on assessment systems are greater than current knowledge and technology can support. In addition, there are fewer procedural safeguards for English-language learners under federal law than for students with disabilities.
When English-language learners are not proficient in the language of the assessment, their scores will not accurately reflect their knowledge. Thus, requiring those who are not proficient in English to take an English-language version of a test without accommodations will produce invalid information about their true achievement (American Educational Research Association et al., 1985; National Research Council and Institute of Medicine, 1997). This can lead to poor decisions about individuals and about English-language learners as a group, as well as about school systems in which they are heavily represented.
Understanding the performance of English-language learners on achievement assessments requires satisfactory assessments of English-language proficiency, in order to determine whether poor performance is attributable to lack of knowledge of the test content or weak skills in English. Lack of a clear or consistent definition of language proficiency, and of indicators or measures of it, contributes to the difficulty of making these decisions more systematically (Olson and Goldstein, 1997).
Research evidence to date does not allow us to be certain about the meaning of test scores for students who are not yet proficient in English and who have received accommodations or modifications in test procedures. For any examination system employing accommodations or modifications, test developers or test users should conduct research to determine whether the constructs measured are the same for all children (Hambleton and Kanjee, 1994; Olson and Goldstein, 1997).
Accommodations and alternative tests should be provided (1) to increase the participation of English-language learners in large-scale assessments and (2) to increase the validity of test results. These two
objectives—obtaining valid information while still testing all English-language learners—create a sizable policy tension for the design of assessment systems, particularly when they involve high stakes.
Recommendation: Systematic research that investigates the impact of specific accommodations on the test performance of both English-language learners and other students is needed. Accommodations should be investigated to see whether they reduce construct-irrelevant sources of variance for English-language learners without disadvantaging other students who do not receive accommodations. The relationship of test accommodations to instructional accommodations should also be studied.
Recommendation: Development and implementation of alternative measures, such as primary-language assessments, should be accompanied by information regarding the validity, reliability, and comparability of scores on primary-language and English assessments.
A sufficient number of English-language learners should be included when items are developed and pilot-tested and in the norming of assessments (Hambleton and Kanjee, 1994). Experts in the assessment of English-language learners might work with test developers to maintain the content difficulty of items while making the language of the instructions as well as actual test items more comprehensible. These modifications would have to be accomplished without making the assessment invalid for other students.
Recommendation: The learning and language needs of English-language learners should be considered during test development.
Various strategies can be used to obtain valid information about the achievement of English-language learners in large-scale assessments. These include native-language assessments and modifications that decrease the English-language load. Such strategies, however, are often employed inconsistently from place to place and from student to student. Monitoring of educational outcomes for English-language learners as a group is needed to determine the intended and unintended consequences of their participation in large-scale assessments.
Recommendation: Policy decisions about how individual English-language learners will participate in large-scale assessments—such as the language and accommodations to be used—should balance the demands of political accountability with professional standards of good testing practice. These standards require evidence that such accommodations or alternate forms of assessment lead to valid inferences regarding performance.
Recommendation: States, school districts, and schools should report and interpret disaggregated assessment scores of English-language learners when psychometrically sound for the purpose of analyzing their educational outcomes.
In addition, the role of the test score in decision making needs careful consideration when its meaning is uncertain. For example, invalid low scores on the test may lead to inappropriate placement in treatments that have not been demonstrated to be effective. Multiple sources of information should be used to supplement test score data obtained from large-scale assessment of students who are not language proficient, particularly when decisions will be made about individual students on the basis of the test (American Educational Research Association et al., 1985).
Recommendation: Placement decisions based on tests should incorporate information about educational accomplishments, particularly literacy skills, in the primary language. Certification tests (e.g., for high school graduation) should be designed to reflect state or local deliberations and decisions about the role of English-language proficiency in the construct to be assessed. This allows for subject-matter assessment in English only, in the primary language, or using a test that accommodates English-language learners by providing English-language assistance, primary language support, or both.
Recommendation: As for all learners, interpretation of the test scores of English-language learners for promotion or graduation should be accompanied by information about opportunities to master the material tested. For English-language learners, this includes information about educational history, exposure to instruction in the primary language and in English, language resources in the home, and exposure to the mainstream curriculum.
Potential Strategies for Promoting Appropriate Test Use
The two existing mechanisms for promoting and enforcing appropriate test use—professional standards and legal enforcement—are important but inadequate.
The Joint Standards, and the Code of Fair Testing Practices in Education, ethical codes of the testing profession, are written in broad terms and are not always easy to interpret in particular situations. In addition, enforcement of the Joint Standards and the Code depends chiefly on professional judgment and goodwill. Moreover, professional self-regulation does not cover the behavior of individuals outside the testing profession. Many users of educational test results—policymakers, school administrators, and teachers—are unaware of the Joint Standards and are untrained in appropriate test use (Office of Technology Assessment, 1992).
Litigation, the other existing mechanism, also has limitations. Most of the pertinent statutes and regulations protect only certain groups of students, and most court decisions are not binding everywhere. Court decisions in different jurisdictions sometimes contradict one another (Larry P. v. Riles, 1984; Parents in Action on Special Education v. Hannon, 1980). Some courts insist that educators observe the principles of test use in the Joint Standards (Office of Technology Assessment, 1992:73–74) and others do not (United States v. South Carolina, 1977). And court challenges are often expensive, divisive, and time-consuming. In sum, federal law is a patchwork of rules rather than a coherent set of norms governing proper test use, and enforcement is similarly uneven.
The committee has explored four possible alternative mechanisms that have been applied to problems similar to that of improper test use and about which empirical literature exists. It offers these as alternatives, some less coercive and others more so, that could supplement professional standards and litigation as means of promoting and enforcing appropriate test use.
• Deliberative forums: In these forums, citizens would meet with policymakers to discuss and make important decisions about testing. In this model, all participants have equal standing and are more likely to accept decisions, even those with which they disagree, because they feel that they have had an opportunity to influence the outcome. All parties with a stake in assessments would be represented. Such discussions could
help define what constitutes "educational quality" and "achievement to high standards," the role that tests should play in shaping and measuring progress toward those goals, and the level of measurement error that is acceptable where test scores are used in making high-stakes decisions about students.
Broad public interest in testing makes this a good time to consider the establishment of such forums. We note, however, the importance of considering potential limitations of this strategy, including: a scarcity of successful examples, the reluctance of those with authority to part with it, and the large amounts of time and patience it would require.
• An independent oversight body: George Madaus and colleagues have proposed creating an independent organization to monitor and audit high-stakes testing programs (Madaus, 1992; Madaus et al., 1997). It would not have regulatory powers but would provide information to the public about tests and their use, highlighting best practices in testing. It could supplement a labeling strategy (see below) by educating policymakers, practitioners, and the public about test practice. It could deter inappropriate test use by creating adverse publicity (Ernest House, personal communication 1998).
The shortcomings of this proposal include the monitoring body's lack of formal authority to require test publishers or school administrators to submit testing programs for review. Similarly, test users would be under no obligation to accept the body's judgments. It will be important for policymakers interested in the work of such a body to prevent unintended negative consequences.
• Labeling: Test producers could be required to report to test users about the appropriate uses and limitations of their tests. A second target of a labeling strategy would be test consumers: parents, students, the public, and the media. Relevant information could include the purpose of the test, intended uses of individuals' scores, consequences for individual students, steps taken to validate the test for its intended use, evidence that the test measures what students have been taught, other information used with test scores to make decisions about individual students, and options for questioning decisions based on test scores.
Limitations of this strategy include limited data on its effectiveness, the obstacles many parents face when they seek to challenge policies and actions with which they disagree, and the ineffectiveness of test labeling when the real problem is poor instruction rather than improper test use.
• Federal regulation: Federal statutes could be amended to include standards of appropriate test use. Title I regulations could be revised to ensure that large-scale assessments comply with established professional standards. State Title I plans could address the extent to which state and local assessment systems meet these professional norms. Title VI of the Civil Rights Act of 1964 and Title IX of the Education Amendment of 1972 prohibit federal fund recipients from discriminating on the basis of race, national origin, or sex; both have been cited in disputes about tests that carry high stakes for students. Under existing regulations, when a test has disproportionate adverse impact, the recipient of federal funds must demonstrate that the test and its use are an "educational necessity." Federal regulations do not, however, define this term. Thus, federal regulations could define educational necessity in terms of compliance with professional testing standards.
The advantages of this strategy would include its applying to all 50 states and virtually all school districts. Federal regulations could also be a powerful tool for educating policymakers and the public about appropriate test use. And relying on them would make use of existing administrative and judicial mechanisms to promote adherence to testing standards that are rarely enforced.
The risks of the regulatory approach are that the sanctions available for failure to comply—cutting off federal aid—make it unwieldy. Moreover, there is political resistance to federal regulation, creating the risk of a backlash that would make it more difficult for the U.S. Department of Education to guide local practice in testing and other areas. This strategy would also be subject to many of the usual disadvantages of administrative and judicial enforcement.
The committee is not recommending adoption of any particular strategy or combination of strategies, nor does it suggest that these four approaches are the only possibilities. We do think, however, that ensuring proper test use will require multiple strategies. Given the inadequacy of current methods, practices, and safeguards, further research is needed on these and other policy options to illuminate their possible effects on test use. In particular, we would suggest empirical research on the effects of these strategies, individually and in combination, on testing products and practice, and an examination of the associated potential benefits and risks.
American Educational Research Association, American Psychological Association, and National Council on Measurement in Education 1985 Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association
1998 Draft Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association
American Federation of Teachers 1997 Passing on Failure: District Promotion Policies and Practices. Washington, DC: American Federation of Teachers.
Anderson, D.K. 1994 Paths Through Secondary Education: Race/Ethnic and Gender Differences. Unpublished doctoral thesis, University of Wisconsin-Madison.
Bishop, John H. 1997 Do Curriculum-Based External Exit Exam Systems Enhance Student Achievement? New York: Consortium for Policy Research in Education and Center for Advanced Human Resource Studies, Cornell University.
Bond, Linda A., and Diane King 1995 State High School Graduation Testing: Status and Recommendations. North Central Regional Educational Laboratory.
Catterall, James S. 1990 A Reform Cooled-Out: Competency Tests Required for High School Graduation. CSE Technical Report 320. UCLA Center for Research on Evaluation, Standards, and Student Assessment.
Cawthorne, John E. 1990 "Tough" Graduation Standards and "Good" Kids. Chestnut Hill, MA: Boston College, Center for the Study of Testing, Evaluation and Educational Policy.
Ceci, Stephen J., Tina B. Rosenblum, and Matthew Kumpf 1998 The shrinking gap between high- and low-scoring groups: Current trends and possible causes. Pp. 287–302 in The Rising Curve, Ulric Neisser, ed. Washington, DC: APA Books.
Clinton, W.J. 1998 Memorandum to the Secretary of Education. Press release. Washington, DC: The White House.
Council of Chief State School Officers 1998 Survey of State Student Assessment Programs. Washington, DC: Council of Chief State School Officers.
Darling-Hammond, L. 1991 The implications of testing policy for quality and equality. Phi Delta Kappan 73(3):220–225.
Darling-Hammond, L., and A. Wise 1985 Beyond standardization: State standards and school improvement. Elementary School Journal 85(3).
Delany, B. 1991 Allocation, choice, and stratification within high schools: How the sorting machine copes. American Journal of Education 99(2):181–207.
Gamoran, A. 1988 A Multi-level Analysis of the Effects of Tracking. Paper presented at the annual meeting, American Sociological Association, Atlanta, GA.
Gamoran, A., and M. Weinstein 1998 Differentiation and opportunity in restructured schools. American Journal of Education 106:385–415.
Glaser, R., and E. Silver 1994 Assessment, Testing, and Instruction: Retrospect and Prospect. Los Angeles, CA: National Center for Research on Evaluation, Standards, and Student Testing.
Grissmer, David, Stephanie Williamson, Sheila N. Kirby, and Mark Berends 1998 Exploring the rapid rise in black achievement scores in the United States (1970–1990). Pp. 251–285 in The Rising Curve, Ulric Neisser, ed. Washington, DC: APA Books.
Grissom, J.B., and L.A. Shepard 1989 Repeating and dropping out of school. Pp. 34–63 in Flunking Grades: Research and Policies on Retention, L.A. Shepard and M.L. Smith, eds. London: Falmer Press.
Hambleton, R.K., and A. Kanjee 1994 Enhancing the validity of cross-cultural studies: Improvements in instrument translation methods. In International Encyclopedia of Education (2nd Ed.), T. Husen and T.N. Postlewaite, eds. Oxford, UK: Pergamon Press.
Hauser, R.M. 1997 Indicators of high school completion and dropout. Pp. 152–184 in Indicators of Children's Well-Being, R.M. Hauser, B.V. Brown, and W.R. Prosser, eds. New York: Russell Sage Foundation.
1998 Trends in black-white test score differentials: 1. Uses and misuses of NAEP/SAT data. Pp. 219–249 in The Rising Curve, Ulric Neisser, ed. Washington, DC: APA Books.
Hochschild, J., and B. Scott 1998 Trends: Governance and reform of public education in the United States. Public Opinion Quarterly 62(1):79–120.
Holmes, C.T. 1989 Grade level retention effects: A meta-analysis of research studies. Pp. 16–33 in Flunking Grades: Research and Policies on Retention, L.A. Shepard and M.L. Smith, eds. London: Falmer Press.
Hoover, H.D., A.N. Hieronymus, D.A. Frisbie, et al. 1994 Interpretive Guide for School Administrators: Iowa Test of Basic Skills, Levels 5–14. University of Iowa: Riverside Publishing Company.
Huang, Min-Hsiung, and Robert M. Hauser 1998 Trends in black-white test-score differentials: 1. The WORDSUM Vocabulary Test. Pp. 303–332 in The Rising Curve, Ulric Neisser, ed. Washington, DC: APA Books.
Jaeger, Richard M. 1989 Certification of student competence. In Educational Measurement, 3rd Ed. Robert L. Linn, ed. New York: Macmillan.
Johnson, J., and J. Immerwahr 1994 First Things First: What Americans Expect from the Public Schools. New York: Public Agenda.
Joint Committee on Testing Practices 1988 Code of Fair Testing Practices. Washington, DC: National Council on Measurement in Education.
Koretz, D.M., R.L. Linn, S.B. Dunbar, and L.A. Shepard 1991 The Effects of High-Stakes Testing on Achievement: Preliminary Findings About Generalization Across Tests. Paper presented at the annual meeting of the American Educational Research Association and the National Council on Measurement in Education. Chicago, IL (April).
Kornhaber, M. 1997 Seeking Strengths: Equitable Identification for Gifted Education and the Theory of Multiple Intelligences. Doctoral dissertation, Harvard Graduate School of Education.
Kreitzer, A.E., F. Madaus, and W. Haney 1989 Competency testing and dropouts. Pp 129–152 in Dropouts from School: Issues, Dilemmas and Solutions, L. Weis, E. Farrar, and H. G. Petrie, eds. Albany: State University of New York Press.
Levin, H. 1988 Accelerated Schools for At-risk Students. New Brunswick, NJ: Center for Policy Research in Education.
Linn, Robert 1998a Assessments and Accountability. Paper presented at the annual meeting of the American Educational Research Association, April, San Diego .
1998b Validating inferences from National Assessment of Educational Progress achievement-level setting. Applied Measurement in Education 11(1):23–47.
Lucas, S. in press Stratification Stubborn and Submerged: Inequality in School After the Unremarked Revolution. New York: Teachers College Press.
Luppescu, S., A.S. Bryk, P. Deabster, et al. 1995 School Reform, Retention Policy, and Student Achievement Gains. Chicago, IL: Consortium on Chicago School Research.
Madaus, G.F., W. Haney, K.B. Newton, and A.E. Kreitzer 1993 A Proposal for a Monitoring Body for Tests Used in Public Policy. Boston: Center for the Study of Testing, Evaluation, and Public Policy.
1997 A Proposal to Reconstitute the National Commission on Testing and Public Policy as An Independent, Monitoring Agency for Educational Testing. Boston: Center for the Study of Testing Evaluation and Educational Policy.
Madaus, G.F., and T. Kellaghan 1991 Examination Systems in the European Community: Implications for a National Examination System in the U.S. Paper prepared for the Science, Education and Transportation Program, Office of Technology Assessment, U.S. Congress, Washington, DC.
McKnight, C.C., and F.J. Crosswhite, J.A. Dossey, E. Kifer, S.O. Swafford, K. Travers, and T.J. Cooney 1987 The Underachieving Curriculum: Assessing U.S. School Mathematics from an International Perspective. Champaign, IL: Stipes Publishing.
Meisels, S.J. 1989 Testing, Tracking, and Retaining Young Children: An Analysis of Research and Social Policy. Commissioned paper for the National Center for Education Statistics.
Mosteller, F., R. Light, and J. Sachs 1996 Sustained inquiry in education: Lessons from skill grouping and class size. Harvard Educational Review 66(4):797–843.
National Academy of Education 1996 Quality and Utility: The 1994 Trial State Assessment in Reading, Robert Glaser, Robert Linn, and George Bohrnstedt, eds. Panel on the Evaluation of the NAEP Trial State Assessment. Stanford, CA: National Academy of Education.
National Research Council 1982 Placing Children in Special Education: A Strategy for Equity, K.A. Heller, W.H. Holtzman, and S. Messick, eds. Committee on Child Development Research and Public Policy, National Research Council. Washington, DC: National Academy Press.
1997 Educating One and All: Students with Disabilities and Standards-Based Reform, L.M. McDonnell, M.L. McLaughlin, and P. Morison, eds. Committee on Goals 2000 and the Inclusion of Students with Disabilities, Board on Testing and Assessment. Washington, DC: National Academy Press.
National Research Council and Institute of Medicine 1997 Improving Schooling for Language-Minority Children, Diane August and Kenji Hakuta, eds. Board on Children, Youth, and Families. Washington, DC: National Academy Press.
O'Day, Jennifer A., and Marshall S. Smith 1993 Systemic reform and educational opportunity. In Designing Coherent Educational Policy, Susan H. Fuhrman, ed. San Francisco: Jossey-Bass.
Oakes, J., A. Gamoran, and R. Page 1992 Curriculum differentiation: Opportunities, outcomes, and meanings. Handbook of Research on Curriculum, P. Jackson, ed. New York: MacMillan.
Office of Technology Assessment 1992 Testing in American Schools: Asking the Right Questions. OTA-SET-519. Washington, DC: U.S. Government Printing Office.
Olson, J.F., and A.A. Goldstein 1997 The Inclusion of Students with Disabilities and Limited English Proficient Students in Large-scale Assessments: A Summary of Recent Progress . NCES 97–482. Washington, DC: U.S. Department of Education, Office of Educational Research and Improvement.
Phillips, S.E. 1993 Testing accommodations for disabled students. Education Law Reporter 80:9–32.
1994 High-stakes testing accommodations. Validity versus disabled rights. Applied Measurement in Education 7(2):93–120.
Reardon, Sean F. 1996 Eighth Grade Minimum Competency Testing and Early High School Dropout Patterns. Paper presented at the annual meeting of the American Educational Research Association, New York, April.
Selvin, M.J., J. Oakes, S. Hare, K. Ramsey, and D. Schoeff 1990 Who Gets What and Why: Curriculum Decisionmaking at 3 Comprehensive High Schools. Santa Monica, CA: Rand.
Shepard, L.A. 1991 Negative policies for dealing with diversity: When does assessment and diagnosis turn into sorting and segregation? Literacy for a Diverse Society: Perspectives, Practices and Policies, E. Hiebert, ed. New York: Teachers College Press.
Shepard, L., et al. 1993 Evaluating test validity. Review of Research in Education 19:405–450.
Shepard, L., S. Kagan, and E. Wurtz, eds. 1998 Principles and Recommendations for Early Childhood Assessments. Washington DC: National Education Goals Panel.
Shepard, L.A., and M.L. Smith 1989 Academic and emotional effects of kindergarten retention in one school district. Pp. 79–107 in Flunking Grades: Research and Policies on Retention, L.A. Shepard and M.L. Smith, eds. London: Falmer Press.
Slavin, R.E., et al. 1996 Every Child, Every School: Success for All. Thousand Oaks, CA: Corwin Press.
Thurlow, M.L., J.E. Ysseldyke, and B. Silverstein 1993 Testing Accommodations for Students with Disabilities: A Review of the Literature. Synthesis Report 4. Minneapolis, MN: National Center on Educational Outcomes, University of Minnesota.
Welner, K.G., and J. Oakes 1996 (Li)Ability grouping: The new susceptibility of school tracking systems to legal challenges. Harvard Educational Review 66(3):451–70.
White, P., A. Gamoran, J. Smithson, and A. Porter 1996 Upgrading the high school math curriculum: Math course-taking patterns in seven high schools in California and New York. Educational Evaluation and Policy Analysis 18(4):285–307.
Willingham, W.W. 1988 Introduction. Pp. 1–16 in Testing Handicapped People, W.W. Willingham, M. Ragosta, R.E. Bennett, H. Braun, D.A. Rock, and D.E. Powers, eds. Boston, MA: Allyn and Bacon.
Debra P. v. Turlington, 474 F. Supp. 244 (M.D. Fla. 1979); aff'd in part and rev'd in part, 644 F.2d 397 (5th Cir. 1981); rem'd, 564 F. Supp. 177 (M.D. Fla. 1983); aff'd, 730 F.2d 1405 (11th Cir. 1984).
Goals 2000, Educate America Act, 20 U.S.C. sections 5801 et seq.
Improving America's Schools Act, 1994.
Individuals with Disabilities Education Act, 20 U.S.C. section 1401 et seq.
Larry P. v. Riles, 495 F. Supp. 926 (N.D. Cal. 1979); aff'd, 793 F.2d 969 (9th Cir. 1984).
Parents in Action on Special Education (PASE) v. Hannon, 506 F. Supp. 831 (N.D. Ill. 1980).
Title I, Elementary and Secondary Education Act, 20 U.S.C. sections 6301 et seq.
United States v. South Carolina