Read "High Stakes: Testing for Tracking, Promotion, and Graduation" at NAP.edu

Page 13 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

1
Introduction

Most people seem to agree that America's public schools are in need of repair. How to fix them has become a favorite topic of policymakers, and for many the remedy includes increased reliance on the testing of students. The standards-based reform movement, for example, is premised on the idea of setting clear, high standards for what children are supposed to learn and then holding students—and often educators and schools—to those standards.

The logic seems clear: Unless we test students' knowledge, how will we know if they have met the standards? And the idea of accountability, which is also central to this theory of school reform, requires that the test results have direct and immediate consequences: a student who does not meet the standard should not be promoted, or awarded a high school diploma. This report is about the appropriate use of tests in making such high-stakes decisions about individual students.

In his 1997 State of the Union address, President Clinton challenged the nation to undertake "a national crusade for education standards—not federal government standards, but national standards, representing what all our students must know to succeed in the knowledge economy of the twenty-first century. . . . Every state should adopt high national standards, and by 1999, every state should test every fourth-grader in reading and every eighth-grader in math to make sure these standards are met. . . . Good tests will show us who needs help, what changes in teaching to

Page 14 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

make, and which schools need to improve. They can help us to end social promotion. For no child should move from grade school to junior high, or junior high to high school until he or she is ready."

Test-based reform strategies have enjoyed wide acceptance across the political spectrum—at least in theory—for two reasons. First, who could possibly be against "high standards"? Second, most Americans believe in the accuracy and fairness of judging students by what the president called "good tests." But what constitutes a good test? How do we know a test is good—that it really measures what it is supposed to measure? And, equally important, how do we know that the test and its results are being used properly by the teachers and administrators who have the power to make important decisions about individual children?

In fact, the use of tests in school reform raises difficult questions in relation to so-called high-stakes consequences for students—that is, when an individual student's score determines not just who needs help but whether a student is allowed to take a certain program or class, or will be promoted to the next grade, or will graduate from high school. Despite the appearance of mathematical exactness in a numerical score, standardized achievement tests do not yield exact measurements of what individuals know and can do. Tests and their applications are subject to both statistical and human error. Tests useful for some purposes are inappropriate for others. Can we be sure that the use of tests for high-stakes decisions will lead to better outcomes for all students, regardless of their special educational needs or their social, economic, racial, or ethnic backgrounds?

The very term "high stakes" embodies both the hopes and the fears these tests inspire. Only if the stakes are high, say their advocates on one hand—only if there is something valuable to be gained or lost—will teachers and students take the tests seriously and work hard to do their best, thus serving both their own interests and the public interest in higher achievement. Skeptics, on the other hand, worry that such policies may produce harmful consequences for individual students and perhaps for society as a whole.

The Clinton administration's proposal for new voluntary national tests (VNTs)—standardized, large-scale tests of 4th grade reading and 8th grade mathematics achievement—has aroused controversy, in part because of these questions of equity and fairness. But whether or not the VNTs are created, large-scale achievement testing is already a major feature of American education, and it appears to be getting more popular.

Page 15 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

Growing Reliance on Standardized Tests

For more than three decades, under Title I of the Elementary and Secondary Education Act of 1965, program evaluation through large-scale testing has been an integral part of federal support for the education of low-achieving children in poor neighborhoods. The minimum competency testing movement, beginning in the 1970s, gave large-scale, standardized achievement tests a visible and popular role in holding students (and sometimes schools) accountable. Such tests are widely used in decisions about promotion and graduation; their role in tracking—that is, assigning students to a course of study based on perceived achievement or skill level—is less clear. Tracking decisions are usually made at the school level, based on multiple sources of evidence.

By the mid-1980s, 33 states had mandated some form of minimum competency testing (Office of Technology Assessment, 1992). A decade later, 18 states had test-based requirements for high school graduation (Bond et al., 1996). In many states, both schools and students are held accountable for achievement-test performance. Almost all states administer standardized assessments in several core areas and report findings at the school level; in most states, these findings are supplemented by state-representative samples from the National Assessment of Educational Progress (NAEP). In almost half the states, students' test performance can have serious consequences for their schools, including funding gains or losses, loss of autonomy or accreditation, and even external takeover (Bond et al., 1996). In some places, like Chicago, the same achievement test is used both to hold schools accountable and to make individual student promotion decisions.

The political debate about voluntary national testing has focused on the inevitable tensions between uniform national standards and traditions of state and local school governance. But other important questions about the VNT proposal have been raised: Do we need new tests to hold American students to uniform high standards, or could the results of existing tests be reported in a common metric? The VNT proposal calls for public release of all test items soon after the administration of each test, but can new tests be developed each year that will meet high technical demands—for validity, reliability, fairness, and comparability? How should the VNT or similar tests be designed in order to measure achievement accurately and encourage higher academic performance by all students? How can potential misuses of the VNT or other tests be identified, remedied, and prevented?

Page 16 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

These issues have been considered by the Congress in its deliberations on voluntary national testing, and it has called on the National Academy of Sciences to carry out studies addressing several of them. ¹ This report addresses the set of questions bearing on the appropriate, nondiscriminatory use of educational tests. Congress has asked the Academy, through its National Research Council, to "conduct a study and make written recommendations on appropriate methods, practices and safeguards to ensure that—

existing and new tests that are used to assess student performance are not used in a discriminatory manner or inappropriately for student promotion, tracking or graduation; and
existing and new tests adequately assess student reading and mathematics comprehension in the form most likely to yield accurate information regarding student achievement of reading and mathematics skills."

The questions the Congress has framed reflect concern about the increasing reliance on tests that have a direct impact on students, including the impact of high-stakes testing on various minority communities and on children with disabilities or whose native language is not English. This study therefore focuses on tests that have high stakes for individual students, although the committee recognizes that accountability for students is related in important ways to accountability for educators, schools, and school systems. Indeed, the use of tests for accountability of educators, schools, and school districts has significant consequences for individual students, for example, by changing the quality of instruction or affecting school management and budgets. Such indirect effects of large-scale assessment are worth studying in their own right. This report is intended to apply to all schools and school systems in which tests are used for promotion, tracking, or graduation.

¹

Results of the study addressing questions related to the feasibility of a common reporting metric appear in Uncommon Measures: Equivalence and Linkage Among Educational Tests (National Research Council, 1999b). A third study, Evaluation of the Voluntary National Tests: Phase 1 (National Research Council, 1999a), is an evaluation of the first year of the VNT development process.

Page 17 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

Ensuring Appropriate Use of Tests

Large-scale cognitive testing has always been controversial. On one hand, standardized testing promises to hold all students to the same standards, appealing to widely held values of fairness and equity. It is also an efficient and highly visible way to assess the progress of students and schools—and to communicate what the public expects of them. As an administrative tool, testing offers a rare economy of scale in school management. On the other hand, tests can be used arbitrarily to sort students into winners and losers, and their validity has thus always been scrutinized and criticized in light of the benefits and costs to test takers and other interested parties. Tests have been used improperly to make decisions about which they provide little or no valid information. Occasionally, tests have provided a cover for arbitrary or discriminatory decisions made with little or no reference to test performance. As the Office of Technology Assessment noted in its 1992 report, "Everyone may agree that testing can be a wedge, but some see the wedge forcing open the gates of opportunity while others see it as the doorstop keeping the gates tightly shut (p. 8)."

Efforts to regulate test use have been based on two principal mechanisms: professional norms, including education and self-regulation, and legal action, including legislation, regulation, and litigation (Office of Technology Assessment, 1992). Through its Mental Measurement Yearbooks , the Buros Institute of Mental Measurement has sought to inform test users about appropriate practices for the past 60 years. Several scientific and professional organizations, acting separately and jointly, have issued standards for appropriate test use. These include the Standards for Educational and Psychological Testing (American Educational Research Association et al., 1985) and the Code of Fair Testing Practices in Education (Joint Committee on Testing Practices, 1988).

These standards have been addressed chiefly to those who develop and publish tests, but the mechanisms for enforcing them are inadequate. Moreover, those who actually use tests in states, school districts, and schools are often poorly informed about the standards. Teachers are usually the first-line administrators and users of tests, but they are often not technically prepared to interpret test findings, nor is the public adequately informed about the uses and limits of testing.

Legal action has played a significant but limited role in ensuring the appropriate use of tests. Before the 1960s, there was little litigation in this area. Since then, courts have occasionally limited the use of tests to

Page 18 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

make high-stakes decisions about individuals, but they have generally been reluctant to limit the professional judgment and discretion of educators (Office of Technology Assessment, 1992:72–74). Most major court interventions have dealt with specific uses of tests that sustained earlier patterns of racial discrimination in Southern schools.

Federal legislation has affected the testing of individual students in two major ways: first, by encouraging or requiring testing, for example, in the Goals 2000: Educate America Act of 1994 and, more significantly, in Title I of the Elementary and Secondary Education Act of 1965;² and, second, by regulating the use of educational tests and information based on them. An example of the latter is the Family Education Rights and Privacy Act of 1974, commonly known as the Buckley Amendment. It established the rights of parents to inspect school records and limited the release of those records (including test scores) to those with a legitimate educational need for the information.

Under the Rehabilitation Act of 1973, the Americans with Disabilities Act of 1990, and the Individuals with Disabilities Education Act of 1997, children with disabilities are entitled to several important protections when they are tested for placement. Among these are the right to be tested in the language spoken at home; the right to take a test that is not culturally biased; the right to accommodations or modifications based on special needs; and the right to be tested in several different ways, so that no special education placement decision is based on a single test score. These protections have not, however, been extended to other uses of educational tests, such as awarding or withholding a high school diploma.

Uses and Misuses of Standardized Tests

Tests are used in a variety of ways. As elaborated in Chapter 2, they can provide feedback to individual students and their teachers about

²

Title I of the Elementary and Secondary Education Act of 1965, also known as Title I, is the largest federal program in elementary and secondary education, with an annual budget of roughly $8 billion. It is intended to assist low-achieving, disadvantaged students. Title I exerts a powerful influence on schools across the country, particularly in the area of testing. Since the Congress revamped Title I in 1994, the law has required states to develop both challenging standards for student performance and assessments that measure student performance against those standards. The law also states that the standards and assessments should be the same for all students, regardless of whether they are eligible for Title I. This statute is discussed more fully in Chapters 3 and 11.

Page 19 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

problems and progress in learning. They can inform administrators and the public about the overall state of learning or academic achievement. They can be used as management tools to make placement or certification decisions about individual students. In many cases it is inappropriate to use the same test for different purposes. Yet that is often what happens.

Consider, for example, how public perceptions about the performance of American schools are formed. They are based in part on personal experience and journalistic anecdotes: the counter clerk at the local store who cannot make change, the business leader's complaint that high school graduates lack basic job skills. But much of the information about academic achievement comes from students' performance on tests, and public opinion about the quality of schooling rises or falls with the latest results from NAEP and the Third International Mathematics and Science Study (TIMSS) (Forgione, 1998).

Test results, like those from NAEP, are based on large, scientifically chosen national samples, and they are repeated periodically. They are designed to provide an overview—a measure of the aggregate performance of a very large number of students. They do not measure the performance of individual students. In fact, the tests are designed so no single student is ever asked the full battery of test questions, and an individual student's results are never released.

This important use of achievement test questions to assess national progress began about 1970, although state and local testing programs date back to the 19th century. Before 1970, there were administrations of achievement tests in well-designed national samples, but these were one-time studies, and they were never extended to compare the performance of all students or of major population groups over time.

Large-scale standardized tests—such as the Scholastic Assessment Test (SAT) and the American College Test (ACT) for college admissions and the Armed Services Vocational Aptitude Battery (ASVAB) for military selection and placement—are valuable decision-making tools. They were not designed to provide information about overall levels of academic achievement for groups of students or changes in them over time. The number of students taking these tests may be very large, but the sample of test takers is far from representative. Public reports based on these tests are therefore often misleading (Hauser, 1998). The annual newspaper reports of average SAT scores, for example, comparing students across time or among states, are a prime example of inappropriate

Page 20 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

test use. Test-taking populations vary widely from year to year and from state to state in ways that render such comparisons almost meaningless. Wisconsin regularly tops the list of state average scores on the SAT, mainly because its state colleges and universities require a different test, the ACT, for admission; Wisconsin students who take the SAT are generally those applying to elite out-of-state colleges—thus the state's average score is inflated.

This kind of test misuse dates back at least to the mass ability testing of military recruits in World War I (the Army Alpha and Beta tests), which were used by some to disparage blacks and new immigrant groups. However large the scale of such tests, their main purpose is to make decisions about individuals, not to inform the public. They have never provided accurate assessments of scholastic achievement or aptitude in the general population (Hauser, 1998). Accurate descriptions of populations, based on valid tests and samples, are a valuable tool of public policy; inaccurate descriptions of populations are serious misuses of tests because of their possible social, political, and economic consequences.

Although the use of tests to describe populations is important, the committee, responding to the Congress's charge, has focused primarily on the use of tests to make high-stakes decisions about individual students. These decisions also have broad and long-lasting consequences for population groups. Tests may be used appropriately or inappropriately—either to create opportunities, or to deny them.

It is helpful to keep in mind that standardized tests have often been used historically to promote equal opportunity. In the early 1930s, the Wisconsin State Testing Service gave a standard test of academic ability to all graduating high school seniors and sent the names of high-scoring students to the state's colleges and universities, so they could identify academically promising recruits. In later years, the testing program was expanded to lower grades, to identify promising students who might need greater academic encouragement.

In some cases, test uses that might have created obstacles to attainment may have led to improved academic performance and enhanced opportunities. Minority advocates feared that the minimum academic requirements imposed by the National Collegiate Athletic Association on aspiring college athletes (known as Proposition 48) would reduce minority college opportunities, but Klein and Bell (1995) found that the higher standards actually had little effect on minority recruitment and led to higher graduation rates among minority athletes. Klein and Bell argue

Page 21 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

that student athletes apparently studied harder in school and took courses that would prepare them better for college. This is a potentially important positive exemplar of test use because the introduction of higher standards through testing parallels broader proposals for standards-based educational reform—including some of the hopes for VNTs.

History provides equally striking examples of the actual or potential misuse of standardized tests to make decisions about individuals. Unhappy with the increasing numbers of immigrants living in New York City, the president of Columbia University in 1917 embraced the use of the Thorndike Tests for Mental Alertness "to limit the number of Jewish students without a formal policy of restriction" (Crouse and Trusheim, 1988:20). In one well-known California case (Larry P. v. Riles, 1984), the court found that inadequately validated IQ tests had been used to discriminate against black schoolchildren, who were assigned disproportionately to classes for the educable mentally retarded, and that California's classes for such students were often an educational dead end. In a Florida case, the state was enjoined from using a high school graduation test because black students, forced to attend segregated, inferior schools, had not been taught the material covered in the test (Debra P. v. Turlington , 1981). And in Rockford, Illinois, testing was recently used to rationalize the assignment of some black high school students to lower tracks, even when their test scores were higher than the scores of some whites assigned to higher tracks (People Who Care v. Rockford Board of Education, 1997).

The case of Debra P. offers an especially clear illustration of a crucial distinction between appropriate and inappropriate test use. Is it ever appropriate to test students on material they have not been taught? Yes, if the test is used to find out whether the schools are doing their job. But if that same test is used to hold students "accountable" for the failure of the schools, most testing professionals would find such use inappropriate. It is not the test itself that is the culprit in the latter case; results from a test that is valid for one purpose can be used improperly for other purposes.

In the examples above, it seems easy with the advantage of hindsight to identify the appropriate and inappropriate uses of tests. In practice it is often not at all obvious, and the judgment may well depend on the position of the observer. Some population groups see their low scores on achievement tests as a stigmatizing and discriminatory obstacle to educational progress. Other groups, with high scores on the same tests, view

Page 22 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

their performance as a sign of merit that opens doors to learning and success. The judgments become harder when one cannot predict the behavioral effects of testing, as in the case of Proposition 48. How does one know whether a high-stakes test use is appropriate or not?

How the Committee Approached Its Task

The charge to the committee from the Congress was potentially massive in scope. The three high-stakes policies under scrutiny—tracking, promotion (and its opposite, retention in grade), and graduation (and its opposite, withholding of the diploma)—are themselves complex and controversial practices. Researchers and policymakers disagree about their effectiveness. Where the research evidence on specific practices is strong, our findings are based on that evidence. But in general, the committee has had neither the time nor the resources to investigate broader educational policy issues. Nevertheless, these issues remain critical. Our specific findings about the appropriate uses of tests should be read with the understanding that retention in grade, tracking, and the withholding of diplomas are decisions that have very significant effects on the lives of students and that those decisions will be made with or without the use of tests.

Public understanding of decisions about tracking, promotion, and graduation is poorly served when they are portrayed simplistically as either-or propositions. The simple alternative to social promotion, for example, is retention—making students repeat the grade with the same curriculum they have just failed. But the available evidence suggests that simple retention only compounds the problem: it produces lower achievement and an increased likelihood that the student will eventually drop out of school. Social promotion and simple retention are really only two of several strategies available to educators when tests and other information show that students are experiencing serious academic difficulty. Other strategies may be more successful in promoting learning and reducing the need for either-or choices. These include early identification of students who are not learning, coupled with the assistance these students need to meet standards for promotion. Therefore, the committee believes that this kind of high-stakes test use should always be part of a larger set of strategies aimed at identifying and addressing educational problems when they are most susceptible to intervention and before they

Page 23 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

lead to negative consequences for students. Test users should consider a wide range of interventions with students who perform poorly.

The committee organized the study by defining "appropriateness" and establishing three criteria for judging whether a test use meets the definition. In our deliberations, we have assumed that the use of tests in decisions about student promotion, tracking, and graduation is intended to serve educational policy goals, such as setting high standards for student learning, raising student achievement levels, ensuring equal educational opportunity, fostering parental involvement in student learning, and increasing public support for the schools.

The three criteria for judging the appropriateness of a particular test use correspond to three broad criteria identified in a previous National Research Council study of the use and misuse of tests (National Research Council, 1982):

Measurement validity: Is the test appropriate for a particular purpose? Is there evidence that the constructs to be measured are relevant in making a decision? Does the test measure those constructs? Is it confounded with other constructs that are not relevant to the decision? Is the test reliable and accurate?
Attribution of cause: Does a student's performance on a test reflect knowledge and skill based on appropriate instruction, or is it attributable to poor instruction? Or is it attributable to factors such as language barriers or disabilities that are irrelevant to the construct being measured?
Effectiveness of treatment: Does performance on the test lead to placements or other decisions that are educationally beneficial and well matched to the student's needs?

The committee has applied each of these standards to the uses of testing that we have examined. A full investigation of the third standard, as noted above, would require an effort that exceeds the committee's resources.

Determining whether the use of tests for promotion, tracking, and graduation produces better overall educational outcomes requires that the intended benefits of the policy be weighed against unintended negative consequences. These costs and benefits must also be balanced with those of making high-stakes decisions about students in other ways, without tests. Moreover, the committee recognizes that test policies may

Page 24 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

have negative consequences for some students even while serving important social or educational policy purposes. Perhaps some would be willing to accept, for example, that some students will be harmed, not helped, by a strict rule linking promotion with getting a certain test score—if that policy leads to increased public confidence and support for the schools. The committee takes no position on the wisdom of such a trade-off; but it is our view that policymakers should fully understand what is at stake and who is most likely to be harmed.

The Congress also asked the National Academy of Sciences to consider whether "existing and new tests adequately assess student reading and mathematics comprehension in the form most likely to yield accurate information regarding student achievement of reading and mathematics skills." This could refer to a wide range of issues, including, for example, the balance of multiple-choice and constructed-response items, the use of student portfolios, the length and timing of the test, the availability of calculators or manipulatives, and the language of administration. However, in considering test form, the committee has chosen to focus on the needs of English-language learners and students with disabilities, in part because these students may be particularly vulnerable to the negative consequences of large-scale assessments. (In the literature, English-language learners have been known as "limited-English-proficient students." We adopt the current nomenclature in referring to this group.) We consider, for these students, in what form and manner a test is most likely to measure accurately a student's achievement of reading and mathematics skills.

Two policy objectives are key for these special populations: one is to increase their participation in large-scale assessments, so that school systems can be held accountable for their educational progress. The other is to test each such student in a manner that accommodates for a disability or limited English proficiency to the extent that either is unrelated to the subject matter being tested, while still maintaining the validity and comparability of test results among all students. These objectives are in tension, and thus present serious technical and operational challenges to test developers and users.

Organization and Limits of the Report

The remainder of Part I provides a broad review of the background and context of large-scale standardized achievement testing with high stakes for individual students. Chapter 2 reviews the policy context and

Page 25 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

frameworks of testing, including the history of test use, the several purposes of testing, the place of testing in current public policy debates, and the perceptions of testing by the public. Chapter 3 summarizes the legal issues in test use, reviewing litigation in which testing was alleged to have been used in a discriminatory fashion or in violation of due process and discussing the legal requirements for curriculum and assessment created by the 1994 reauthorization of Title I of the Elementary and Secondary Education Act. Chapter 4 reviews key concepts in testing as a process of psychological measurement, including validity, reliability, and fairness.

Part II examines the uses of tests for making high-stakes decisions about individual students. Three chapters focus on specific practices: tracking and placement (Chapter 5), promotion and retention (Chapter 6), and awarding or withholding high school diplomas (Chapter 7). In each of these chapters, the committee has investigated the ways in which tests have been used to make decisions about students. It has considered the purposes of each policy and the conditions under which tests can appropriately be used to further those purposes. It has reviewed evidence about the use of tests to make each kind of decision and about the educational consequences of those decisions. It has also looked for examples of test-based decision making that improve on the traditional options in each type of decision.

In the next two chapters, the committee focuses on special groups of students: those with disabilities (Chapter 8) and English-language learners (Chapter 9). In the committee's judgment, however, the issues affecting these students cannot be separated from the larger questions of test use for tracking, promotion, and graduation. In Chapter 10, the committee considers whether it would be appropriate to make tracking, promotion, or graduation decisions about individual students based on their VNT scores.

Part III turns to methods of ensuring the appropriate use of tests for making high-stakes decisions about individuals. Chapter 11 reviews the history of professional norms and legal action in the social control of test use and offers several options for improving test use. Chapter 12 presents the committee's findings and recommendations.

Throughout its work, the committee has observed that statements about the benefits or harms of achievement testing often go beyond what the evidence will support. On one hand, blanket criticisms of standardized testing are mistaken. When tests are used in ways that meet technical, legal, and educational standards, students' scores provide important

Page 26 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

information that, combined with information from other sources, can promote both learning and equal opportunity. On the other hand, tests can reinforce and legitimize biases and inequalities that persist in American society and its schools. Used improperly, tests can have serious negative consequences—for individuals, particular groups, and society as a whole. Test developers and test users therefore bear a heavy responsibility to ensure that tests are used appropriately and without discrimination.

The committee has used many sources of information to prepare this report. Initially, we looked for evidence in the scientific and professional literature of testing and of educational practice and in reports of major test publishers and of federal statistical agencies. We have relied on reports of public and professional groups and on the existing and draft standards for appropriate test use of the major educational and psychological organizations. We also analyzed data, in particular pertaining to student promotion and retention. We have interviewed educational administrators in several large school districts, and we have solicited information from state education agencies. Finally, we held a workshop in which committee members were able to discuss the uses of large-scale assessments with educators in national, regional, state, and local agencies and jurisdictions.

The appropriate use of tests is a complex and multifaceted issue. It raises many problems, and they have many solutions. In its short life, the committee has attempted to identify key issues in high-stakes testing, to review and assess current uses of testing in key educational decisions about individual students, and to suggest ways of improving the use of tests to ensure better outcomes. We have necessarily had to limit the scope of our inquiry and, in particular, we have identified the consequences of certain kinds of decisions as a critical arena for educational policy. When educators and parents make decisions about tracking, promotion, and graduation, in many parts of the nation the current range of options may not be those that best serve the interests of students. In the committee's view, new policy options should be explored and their consequences for educational outcomes should be evaluated.

Page 27 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×

References

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education 1985. Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association.

Bond, L.A., D. Braskamp, and E.D. Roeber 1996. The Status of State Student Assessment Programs in the United States . Oak Brook, IL: North Central Regional Educational Laboratory and Council of Chief State School Officers.

Crouse, James, and Dale Trusheim 1988. The Case Against the SAT. Chicago, IL: University of Chicago Press.

Forgione, P.D., Jr. 1998. Achievement in the United States: Progress Since A Nation at Risk? Washington, D.C.: Center for Education Reform and Empower America.

Hauser, R.M. 1998. Trends in black-white test score differences: I. Uses and misuses of NAEP/SAT data. Pp. 219–249 in The Rising Curve: Long-Term Gains in IQ and Related Measures, Ulric Neisser, ed. Washington, DC: American Psychological Association.

Joint Committee on Testing Practices 1988. Code of Fair Testing Practices in Education. Washington, DC: National Council on Measurement in Education.

Klein, S.P., and R.M. Bell 1995. How will the NCAA's new standards affect minority student-athletes? Reprinted from Chance Summer 8(3):18–21.

National Research Council 1982. Placing Children in Special Education: A Strategy for Equity, K.A. Heller, W.H. Holtzman, and S. Messick, eds. Committee on Child Development Research and Public Policy. Washington, DC: National Academy Press.

1999a Evaluation of the Voluntary National Tests: Phase 1, L.L. Wise, R.M. Hauser, K.J. Mitchell, and M.J. Feuer, eds. Board on Testing and Assessment. Washington, DC: National Academy Press.

1999b. Uncommon Measures: Equivalence and Linkage Among Educational Tests , M.J. Feuer, P.W. Holland, B.F. Green, M.W. Bertenthal, and F.C. Hemphill, eds. Committee on Equivalency and Linkage of Educational Tests, Board on Testing and Assessment. Washington, DC: National Academy Press.

Office of Technology Assessment 1992. Testing in American Schools: Asking the Right Questions. OTA-SET-519. Washington, DC: U.S. Government Printing Office.

Legal References

Debra P. v. Turlington, 474 F. Supp. 244 (M.D. Fla. 1979); aff'd in part and rev'd in part, 644 F.2d 397 (5th Cir. 1981); rem'd, 564 F. Supp. 177 (M.D. Fla. 1983); aff'd, 730 F.2d 1405 (11th Cir. 1984).

Page 28 Cite

Suggested Citation:"1 Introduction." National Research Council. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation. Washington, DC: The National Academies Press. doi: 10.17226/6336.

×