The issues surrounding comparability and equivalency of educational assessments, although not new to the measurement and student testing literature, received broader public attention during congressional debate over the Voluntary National Tests (VNT) proposed by President Clinton in his 1997 State of the Union address. If there is any common ground shared by the advocates and opponents of national testing, it is the potential merits of bringing greater uniformity to Americans' understanding of the educational performance of their children. Advocates of the VNT argue that this is only possible through the development of a new test, while opponents have suggested that statistical linkages among existing tests might provide a basis for comparability.
To help inform this debate, Congress asked the National Research Council (NRC) to study the feasibility of developing a scale to compare, or link, scores from existing commercial and state tests to each other and to the National Assessment of Educational Progress (NAEP). This question, stated in Public Law 105-78 (November 1997), was one of three, stemming from the debate over the VNT, that the NRC was asked to study. Under the auspices of the Board on Testing and Assessment, the NRC appointed the Committee on Equivalency and Linkage of Educational Tests in January 1998.
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
--> Executive Summary The issues surrounding comparability and equivalency of educational assessments, although not new to the measurement and student testing literature, received broader public attention during congressional debate over the Voluntary National Tests (VNT) proposed by President Clinton in his 1997 State of the Union address. If there is any common ground shared by the advocates and opponents of national testing, it is the potential merits of bringing greater uniformity to Americans' understanding of the educational performance of their children. Advocates of the VNT argue that this is only possible through the development of a new test, while opponents have suggested that statistical linkages among existing tests might provide a basis for comparability. To help inform this debate, Congress asked the National Research Council (NRC) to study the feasibility of developing a scale to compare, or link, scores from existing commercial and state tests to each other and to the National Assessment of Educational Progress (NAEP). This question, stated in Public Law 105-78 (November 1997), was one of three, stemming from the debate over the VNT, that the NRC was asked to study. Under the auspices of the Board on Testing and Assessment, the NRC appointed the Committee on Equivalency and Linkage of Educational Tests in January 1998.
OCR for page 1
--> Key Issues The committee faced a relatively straightforward question: Is it feasible to establish an equivalency scale that would enable commercial and state tests to be linked to one another and to the National Assessment of Educational Progress (NAEP)? The committee has reviewed research literature on the statistical and technical aspects of creating valid links between tests and on how the content, use, and purposes of educational testing in the United States influence the quality and meaning of those links. We issued an interim report in June 1998. Testing experts have long used various statistical calculations, or linking procedures, to connect the scores from one test with those of another—in other words, to interpret a student's score on one test in terms of the scores on a test the student has not taken. A common analogy for linking tests is the formula used to convert Celsius temperatures to the Fahrenheit scale: for Americans traveling to Europe, it pays to know that 30 degrees is quite warm, not 2 degrees below freezing. Indeed, in some tightly circumscribed cases, linkage across tests is not very different. For example, equating is used to make alternate forms of the Scholastic Assessment Test (SAT) equivalent, so that college admissions officers are sure that a score of 600 means much the same thing regardless of which form of the SAT a student took (because a different form of the SAT is given at each major test administration). But in most cases, especially those that motivate this report, linking test scores in a useful way involves more complex considerations than conversions of temperature or equating nearly identical tests across their multiple forms. For example, clusters of states are looking at possible linkages to stimulate greater comparability between scores on the state tests and between scores on the state tests and NAEP. These situations require linking tests that do not meet the strict requirements for equating and must take into account an array of complicated and complicating factors such as definition of educational goals, uses of tests, and varied emphasis on the multiplicity of skills and knowledge that comprise mastery in different subject areas. In evaluating the feasibility of linkages, the committee focused on the linkage of various 4th-grade reading tests and the linkage of various 8th-grade mathematics tests (the topics and grades designated in the VNT proposal). We concentrated on factors that affect the validity of the inferences about student performance that users would draw from the linked test scores. We note that it is often possible to calculate arithmetic
OCR for page 1
--> linkages that create misleading interpretations of student performance. To cite an extreme case, one could create a formula to link a reading test and a mathematics test, but the resulting scores would be ambiguous, since mathematics performance cannot be interpreted in terms of the skills used in reading. Even in less extreme situations, links between tests that differ in less dramatic ways can produce scores that are substantially misleading. Moreover, a link between two specific tests may be appropriate for one purpose, but unacceptable for others. Thus, linkage between tests involves factors that are not apparent in the analogy with linking temperature scales. These factors might be relevant whether 2 tests—or 200—are being linked. A difference between tests on any one of these factors, though not always sufficient to disqualify the proposed linkage, signals a warning about misinterpretations that may result. Assumptions In approaching its charge, the committee made three key assumptions. First, the question motivating the study is predictable and sensible. It manifests a historical tension in the American educational system between a belief that curriculum, instruction, and assessment are best designed and managed at the state and local levels and a desire to bring greater uniformity to the reporting of information about student achievement in the nation's diverse educational system. Second, though Congress was not explicit about the purposes of linkage, we recognize that the study originated in the debate over President Clinton's proposal for national tests of reading and mathematics. But the committee's charge is a narrowly defined and technical one, namely, to evaluate the feasibility of developing a scale to compare individual scores on existing tests to one another and to NAEP. Some of our findings are directly relevant to technical aspects of the VNT, for example, the requirement that it be linked to NAEP. And the committee acknowledges that a key underlying issue in the debate over the VNT is the utility of nationally comparable information on individual student achievement. However, the committee has no position on the overall merits of the VNT, and in making conclusions about the feasibility of linking existing tests we do not intend to suggest either that the nation should or should not have national tests. Neither policy decision follows inevitably from our basic conclusions about linkage and equivalency. Third, we adopted a definition of “feasibility" that combines validity
OCR for page 1
--> and practicality. Validity is the central criterion for evaluating any inferences based on tests and is applied in this report to inferences based on linkages among tests. By practicality we mean not only whether linkages can be calculated, in the arithmetic sense, but whether the costs of carrying out the linkages are reasonable and manageable. Conclusions In drawing our conclusions, the committee acknowledges that, ultimately, policy makers and educators must take responsibility for determining the degree to which they can tolerate imprecision in testing and linking. In other words: test-based decisions involve error, linkage can add to the error, and we realize that responsible people may reach different conclusions about the minimally acceptable level of precision in linkages that are intended to serve various goals. Our role is to provide science-based information on the possible sources and magnitude of the imprecision, in the hope that alerting educators and policy makers to the possibility of errors and their consequences will prove useful. In the committee's interim report, we reached two basic conclusions: 1. Comparing the full array of currently administered commercial and state achievement tests to one another, through the development of a single equivalency or linking scale, is not feasible. 2. Reporting individual student scores from the full array of state and commercial achievement tests on the NAEP scale and transforming individual scores on these various tests and assessments into the NAEP achievement levels are not feasible. We reached these conclusions despite our appreciation of the potential value of a technical solution to the dual challenges of maintaining diversity and innovation in testing while satisfying growing demands for nationally benchmarked data on individual student performance. We have now considered two additional issues relevant to the committee's charge. First, we have examined whether it is feasible to link smaller subsets of tests, other than the existing "full array," and to use these linkages to make meaningful comparisons of student performance. Second, we have studied in greater depth the questions involved in reporting individual scores from any test on the NAEP scale and in terms of the NAEP achievement levels.
OCR for page 1
--> On these questions our level of optimism is not much higher. We find that simply reducing the number of tests under consideration does not necessarily increase the feasibility of linkage unless the tests to be linked are very similar in a number of important ways. We also find that interpreting the scores on any test in terms of the NAEP achievement levels poses formidable technical and interpretive challenges. Therefore, the Committee has reached the following two additional conclusions: 3. Under limited conditions it may be possible to calculate a linkage between two tests, but multiple factors affect the validity of inferences drawn from the linked scores. These factors include the content, format, and margins of error of the tests; the intended and actual uses of the tests; and the consequences attached to the results of the tests. When tests differ on any of these factors, some limited interpretations of the linked results may be defensible while others would not. 4. Links between most existing tests and NAEP, for the purpose of reporting individual students' scores on the NAEP scale and in terms of the NAEP achievement levels, will be problematic. Unless the test to be linked to NAEP is very similar to NAEP in content, format, and uses, the resulting linkage is likely to be unstable and potentially misleading. (The committee notes that it is theoretically possible to develop an expanded version of NAEP that could be used in conducting linkage experiments, which would make it possible to establish a basis for reporting achievement test scores in terms of the NAEP achievement levels. However, the few such efforts that have been made thus far have yielded limited and mixed results.)
OCR for page 1
This page in the original is blank.