National Academies Press: OpenBook
« Previous: 4 Tests and Testing in the United States: A Picture of Diversity
Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×

5
Conclusions

The Committee on Equivalency and Linkage of Educational Tests was created to answer a relatively straightforward question: Is it feasible to establish an equivalency scale that would enable commercial and state tests to be linked to one another and to the National Assessment of Educational Progress (NAEP)? In this report we have attempted to answer this question by examining the fundamentals of tests and the nature of linking; reviewing the literature on linking, including previous attempts to link different tests; surveying the landscape of tests and testing programs in the United States; and looking at the unique characteristics and qualities of NAEP.

Factors that Affect the Validity of Links

Test Content

A test is a sample of a much larger, more complex body of content, a domain. Test developers must make choices about the knowledge, skills, and topics from the domain they want to emphasize. The choices are numerous in a vast domain like reading or mathematics, where there are differing opinions about what should be taught, how it should be taught, and how it should be tested. Therefore, two state tests labeled "4th-grade reading" may cover very different parts of the domain. One test might

Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×

ask students to read simple passages and answer questions about the facts and vocabulary of what they read, thereby testing simple recall and comprehension; another test might ask students to read multiple texts and make inferences that relate them, thereby testing analytic and interpretive reading skills.

Tests with different content may measure different aspects of performance and may produce different rankings or score patterns among test takers. For example, students who have trouble with algebra—or who have not yet studied it in their mathematics classes—may do poorly on a mathematics test that places heavy emphasis on algebra. But these same students might earn a high score on a test that emphasizes computation, estimation, and number concepts, such as prime numbers and least common multiples. When content differences are significant, scores from one test provide poor estimates of scores on another test: any calculated linkage between them would have little practical meaning and would be misleading for many uses.

Test Format

Tests are becoming more varied in their formats. In addition to multiple-choice questions, many state assessments now include more open-ended questions that require students to develop their own responses, and some include performance items that ask students to demonstrate knowledge by performing a complex task. Computer-based testing is another alternative format that has gained in popularity in recent years. The effects of format differences on linkages are not always predictable, and they are sometimes large (see, e.g., Shavelson et al., 1992).

Measurement Error

Every test is only a sample of a person's performance. If a test taker also took an equivalent, but not identical, test on a different day in a different place, her score is unlikely to be the same. That is, a test score always has some margin of error (which testing professionals call the standard error of measurement). Measurement error plays a role in the interpretation and use of scores on linked tests. If test A, with a large margin of error, is linked with test B, which is much more precise, the score of a person who took test A still has the margin of error of test A, even when reported in terms of the scale of test B. Students and test users

Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×

can be misled by this difference in precision. A short test with unreliable (i.e., less precise) scores can seem to have more precision than it actually has if it is reported on the scale of the more reliable test.

Test Uses and Consequences

Variations in how tests are used, especially their consequences, can affect the stability of linkages over time. Many states are using or planning to use tests for high-stakes decisions, such as determining graduation for students, compensation for teachers, rating for schools or districts (National Research Council, 1999c). In contrast, other assessments, like NAEP, often have lower stakes for test takers, with no important consequences for individuals or others. When test stakes are low for them, students may have little incentive to take the test seriously; when they have reason to worry about the consequences of their scores, they are usually more motivated to try harder. When stakes are high, teachers are likely to alter instruction to try to produce higher scores, through such strategies as focusing on the specific knowledge, skills, and formats in that particular test. The strengths and weaknesses of these and other test-based accountability practices are controversial, and they are not the subject of this report. The important point for this report is that when a high-stakes test is linked with a low-stakes test, the relative difficulty of the two tests is likely to change (i.e., the high-stakes test will appear to become easier as the curriculum becomes aligned with it), and this can affect the stability of a linkage over time.

Evaluating Linkages

All of these factors—content emphases, difficulty, format, measurement error, and uses and consequences—point to the difficulty of establishing trustworthy links among different tests. But the extent to which any of these factors affects linkage can be determined only by a case-by-case evaluation of specific tests in a specific context. Developers of linkages should look carefully at the differences in content emphases, format, and intended uses of tests before deciding to link them. They should also set targets for the level of accuracy that will be required to support the intended uses of the linkage. Developers of linkages should also conduct empirical studies to determine the accuracy and stability of the linkage. In this report the committee suggests some criteria to be

Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×

considered as part of this process. One noteworthy criterion is the similarity or dissimilarity of linkage functions developed using data from different subgroups (e.g., gender, ethnicity, race) of students.

Finally, since linkage relationships can change relatively quickly, especially in high-stakes situations, developers need to continue to monitor linkages regularly to make necessary adjustments to the linking function over time. The research literature is rife with examples of linkages that looked good at first but failed to hold up over time.

NAEP Achievement Levels

Even if two or more tests satisfy the appropriate criteria and prove to be amenable to linkage, linking any or all of them to NAEP poses unique challenges. This is particularly true when the goal of the linkage is to report individual student scores in terms of the NAEP achievement levels—basic, proficient, advanced—established and defined by the National Assessment Governing Board. Problems arise for several reasons.

First, NAEP is designed to estimate and report distributions of student scores by state, region, or the nation as a whole, but it is not designed to report individual student scores. It uses a matrix sampling technique in which each student answers a relatively small number of items from the total set then aggregates their scores in order to report group results. Such data are quite imprecise at the student-level, and they are not well suited for use in standard procedures for linking individual scores (see, e.g., Beaton and Gonzalez, 1995). Most studies that have obtained links with the NAEP scale have prepared a test made from NAEP items, which was then given to individual students who had also taken the test being linked (see, e.g., Williams et al., 1995). Such NAEP stand-in tests must reflect the full content of the NAEP assessment and must also maintain the specific combination of item formats. They must also be administered in a way as nearly like the NAEP procedure as possible. Linking a test to a variant of NAEP that has a different mix of item formats, or a different balance of content, could produce a link whose validity is suspect (see, e.g., Linn et al., 1992).

Unique challenges arise in linking any other test with NAEP when the goal of the linkage is to report individual student scores in terms of the achievement levels. First, all test scores, including a NAEP score inferred from a linked test, have associated measurement error: even if a student took a different form of the same basic test, her score on that form

Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×

might be somewhat higher or lower than the score she obtained on the form of the test she actually did take. The margin of error problem is not usually significant for students whose scores fall in the middle of an achievement category. It may be a problem, however, for students whose scores are near the border of two adjacent levels. Some of these students could easily deserve to be in an adjacent category. Every teacher knows that a high B and a low A could easily be reversed on another occasion. When NAEP estimates the proportion of students in each category, for its reports, such potential classification errors are accounted for. If the linked test is not a close match to NAEP, the classification differences can be substantial. This challenge might be addressed through a special administration of a longer version of NAEP, perhaps by testing students with many more items than they complete in a standard NAEP assessment.

Second, differences in formats or combinations in formats used in different tests are a special concern. Changing the proportion of multiple-choice items to constructed-response items could place a student in a different achievement level. Any special variant of NAEP designed for use in a linking study must maintain the mix of formats used in NAEP (as specified in the NAEP test specifications).

Over all, the committee urges caution in attempting to link achievement tests to NAEP and to report individual student scores on those tests in terms of the NAEP achievement levels.

Conclusions

Our findings, as summarized above, lead us to the following conclusions:

Comparing the full array of currently administered commercial and state achievement tests to one another, through the development of a single equivalency or linking scale, is not feasible.

Reporting individual student scores from the full array of state and commercial achievement tests on the NAEP scale and transforming individual scores on these various tests and assessments into the NAEP achievement levels are not feasible.

Under limited conditions it may be possible to calculate a linkage between two tests, but multiple factors affect the validity of inferences that may be drawn from the linked scores. These factors include the

Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×

context, format, and margin of error of the tests; the intended and actual uses of the tests; and the consequences attached to the results of the tests. When tests differ on any of these factors, some limited interpretations may be defensible, while others would not.

Links between most existing tests and NAEP, for the purpose of reporting individual students' scores on the NAEP scale and in terms of the NAEP achievement levels, will be problematic. Unless the test to be linked to the NAEP is very similar to NAEP in context, format, and uses, the resulting linkage could be unstable and potentially misleading. (The committee notes that it is theoretically possible to develop an expanded version of NAEP that could be used in conducting linkage experiments, which would make it possible to establish a basis for reporting achievement test scores in terms of the NAEP achievement levels. However, the few such efforts that have been made thus far have yielded limited and mixed results.)

The committee arrived at these conclusions notwithstanding the fact that we believe that the goal of bringing greater coherence to the reporting of student achievement data, without compromising the increasingly rich and innovative tapestry of tests in the United States today, is an understandable one. We respect both the judgments of states and districts that have produced the diverse array of tests and the desire for more information than current tests can provide. Furthermore, the committee was disposed, as are large segments of the measurement and educational policy communities, to seek a technological solution to the challenge of linking.

Future Research

Despite our pessimism, we believe there are a number of areas where further research could prove fruitful and could help advance the idea of linkage of educational tests. First, we suggest research on the criteria for evaluating the quality of linkages. In its deliberations, the committee identified several such criteria, but we were unable to determine which were the most critical, and we cannot claim to have developed the exhaustive or definitive set of criteria. Additional study, for example on methods for assessing content congruence, could prove beneficial. The work of Kenney and Silver (1997) and of Bond and Jaeger (1993) represent important approaches to the problem. These researchers had to

Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×

invent methods for establishing the extent to which test contents match; those methods need additional research and development, especially with respect to providing quantitative estimates of congruence that could be used in evaluating (predicting) the validity of proposed linkages.

Second, we suggest further research to determine the level of precision needed to make valid inferences about linked tests. We know that two tests that are built to different content frameworks, or to different test specifications, are looking at the test taker in two different ways. Each perspective may yield valid information, although not the same information. How important are the differences? Are they so minor that the differences can be overlooked? Are the biases sufficiently large to lead to misleading interpretations, or are they so small that they are inconsequential, although statistically detectable? And how can one determine what is "consequential"? What kind of guidelines do policy makers need in order to determine an acceptable level of error? In addressing these questions, the research community could make an important contribution to the policy debate by focusing on the marginal decrements in validity or precision of inferences that can be attributed to linkage, independent of the imprecision or invalidity attributable to the tests themselves. More research on methods of assessing the quality of linked assessment information would go a long way in making these important judgments

Finally, we urge further research on the reporting of linked assessment information. The committee found that one way of reporting a students' performance in terms of NAEP achievement levels is to state that, among 100 students who performed at the same level as the student, call her Sally, 10 are likely to be in the below basic category, 60 are likely to be basic; 28 are likely to be proficient; and 2 are likely to be in the highest, or advanced category.

While such information may be statistically valid, its utility is questionable. More research might point to ways in which reports from linking tests could provide information that is useful to students, parents, teachers, administrators, and policy makers.

Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×
Page 87
Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×
Page 88
Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×
Page 89
Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×
Page 90
Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×
Page 91
Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×
Page 92
Suggested Citation:"5 Conclusions." National Research Council. 1999. Uncommon Measures: Equivalence and Linkage Among Educational Tests. Washington, DC: The National Academies Press. doi: 10.17226/6332.
×
Page 93
Next: References »
Uncommon Measures: Equivalence and Linkage Among Educational Tests Get This Book
×
Buy Paperback | $40.00 Buy Ebook | $31.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The issues surrounding the comparability of various tests used to assess performance in schools received broad public attention during congressional debate over the Voluntary National Tests proposed by President Clinton in his 1997 State of the Union Address. Proponents of Voluntary National Tests argue that there is no widely understood, challenging benchmark of individual student performance in 4th-grade reading and 8th-grade mathematics, thus the need for a new test. Opponents argue that a statistical linkage among tests already used by states and districts might provide the sort of comparability called for by the president's proposal.

Public Law 105-78 requested that the National Research Council study whether an equivalency scale could be developed that would allow test scores from existing commercial tests and state assessments to be compared with each other and with the National Assessment of Education Progress.

In this book, the committee reviewed research literature on the statistical and technical aspects of creating valid links between tests and how the content, use, and purposes of education testing in the United States influences the quality and meaning of those links. The book summarizes relevant prior linkage studies and presents a picture of the diversity of state testing programs. It also looks at the unique characteristics of the National Assessment of Educational Progress.

Uncommon Measures provides an answer to the question posed by Congress in Public Law 105-78, suggests criteria for evaluating the quality of linkages, and calls for further research to determine the level of precision needed to make inferences about linked tests. In arriving at its conclusions, the committee acknowledged that ultimately policymakers and educators must take responsibility for determining the degree of imprecision they are willing to tolerate in testing and linking. This book provides science-based information with which to make those decisions.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!