The type of embedding that the committee considered to be most central to its charge is including parts of a national assessment in state assessment programs in order to provide individual students with national scores that are (1) comparable with the scores that would have been obtained had the national assessment been administered to them in its entirety and (2) comparable from state to state. The embedded material could be generated from fixed portions of a national assessment or it could comprise test questions chosen by state policy makers. The national scores could be obtained either with or without statistical linkage between the embedded material and the questions in the state assessment.
CONCLUSION 1: Embedding part of a national assessment in state assessments will not provide valid, reliable, and comparable national scores for individual students as long as there are (1) substantial differences in content, format, or administration between the embedded material and the national test that it represents or (2) substantial differences in context or administration between the state and national testing programs that change the ways in which students respond to the embedded items.
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 62
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests 5 Conclusions The type of embedding that the committee considered to be most central to its charge is including parts of a national assessment in state assessment programs in order to provide individual students with national scores that are (1) comparable with the scores that would have been obtained had the national assessment been administered to them in its entirety and (2) comparable from state to state. The embedded material could be generated from fixed portions of a national assessment or it could comprise test questions chosen by state policy makers. The national scores could be obtained either with or without statistical linkage between the embedded material and the questions in the state assessment. CONCLUSION 1: Embedding part of a national assessment in state assessments will not provide valid, reliable, and comparable national scores for individual students as long as there are (1) substantial differences in content, format, or administration between the embedded material and the national test that it represents or (2) substantial differences in context or administration between the state and national testing programs that change the ways in which students respond to the embedded items.
OCR for page 62
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests National scores that are derived from an embedded national test or test items are likely to be both imprecise and biased, and the direction and extent of bias is likely to vary in important ways—e.g., across population groups and across schools with different curricula. The impediments to deriving valid, reliable, and comparable national scores from embedded items stem from three sources: differences between the state and national tests; differences between the state and national testing programs, such as the procedures used for test administration; and differences between the embedded material and the national test from which it is drawn. When the state and national tests differ substantially in emphasis (content, format, difficulty, etc.), performance on the embedded material may be appreciably different when it is included with the state test than it is in the national test. That is, performance may be influenced by the different context in which items are presented. As a result, seemingly similar levels of performance are likely to have different meanings. Inferences about individual performance from embedded test material similarly could be substantially distorted by many differences between the national and state testing program in administration and context, regardless of the characteristics of the two tests and the embedded items. Under the rubric of "administration and context" we include: differences in the time of year at which the test is administered; differences in test context (i.e., the surrounding test material); differences in the broader context (such as differences in motivation stemming from high stakes); differences in assessment accommodations for students with special needs; and differences in actual test administration, such as the behavior of proctors. The effects of some of these differences can be large. Aggregated scores from embedded material could also be biased by differences in the inclusion of students with disabilities or limited proficiency in English, as well as other differences in the percentages of students actually tested. Other impediments stem from the nature of the embedded material itself. When only modest amounts of material from a test are embedded, the resulting scores are likely to be unreliable. Moreover, modest selections of material from the national test may fail to represent the national test adequately, which could bias interpretations of performance on the embedded material. This bias would likely affect some individuals and states more than others. We agree with the conclusions in Uncommon Measures (National Research Council, 1999c) that statistical linkage will not suffice to overcome the limited amount and likely unrepresentative-
OCR for page 62
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests ness of embedded test materials. As differences in emphasis among tests are reduced, this fundamental obstacle will shrink, but so will the need for embedding. It is important to note that while some of these impediments to obtaining adequate scores are tractable, others are not. For example, states could time their own assessments to match the timing of the national assessment that is the source of embedded material, to resolve problems stemming from differences in timing. But differences in use, motivation, and test security could prove insurmountable obstacles to providing comparable scores. Another threat to inferences based on embedding is particularly important in an era of test-based accountability: the likely changes over time in the relationship between the state and the national test. In Uncommon Measures (National Research Council, 1999c), this problem was discussed in terms of the instability of linkages, but it extends beyond linking and can affect inferences from embedded material even in the absence of statistical linkage. To some extent, this problem may arise even in the absence of high stakes: for example, changes in student populations, unintended and intended changes in the design of assessments, and other unmeasured factors may cause a shift in the scale of measurement, so that it becomes either easier or harder to attain a given score. However, high stakes may greatly increase the instability of any concordance between the state and national tests. Under such circumstances, assuming that the performance on embedded material has a stable relationship to performance on the parts of the national test that are not administered would lead to biased estimates of performance gains. Criterion-referenced inferences pose a particular difficulty for embedding. Criterion-referenced conclusions, including those expressed in terms of performance standards such as the NAEP achievement levels, entail inferences about the specific knowledge and skills that students exhibit at each performance level. To the extent that embedded material is abridged or unrepresentative of the national test, these inferences may be particularly difficult to support on the basis of performance on the embedded material. Because of the large number of obstacles to success and the intractability of some of them, the committee does not offer recommendations for making these forms of embedding more successful. Rather, the committee concludes that under most circumstances, embedding should not be
OCR for page 62
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests used as a method of estimating scores for individual students on a national test that is not fully administered. Under certain circumstances, however, an alternate approach may provide adequate national scores. CONCLUSION 2: When a national test designed to produce individual scores is administered in its entirety and under standard conditions that are the same from state to state and consistent with its standardization, it can provide a national common measure. States may separately administer custom-developed, state items close in time with the national test and use student responses to both the state items and selected national test items to calculate a state score. This approach provides both national and state scores for individual students and may reduce students' testing burdens relative to the administration of two overlapping tests. This approach assumes that the state items are neither physically embedded in the national test nor administered at precisely the same time and therefore will not generate context effects that alter performance on the national test. It differs from the situation discussed above in several key respects. Because the national assessment is administered completely and under standard conditions, many of the threats to comparability of national scores—such as context effects, differences in timing, and differences in administration—may be avoided. It is important to note, however, that this approach does have limitations. It becomes less and less efficient as differences between the national test and state standards and test specifications grow larger. It provides a national measure only for states that use the same national test; different national tests can provide results that are not comparable. Moreover, depending on the design of the assessment and the uses to which it is put, it is vulnerable to some other threats to comparability, such as inflation of scores from coaching and bias from differences in the exclusion of low-scoring groups. If administrative conditions differ, performance on the national items that contribute to state scores could be different than it would be if they were administered with the state items. The committee did not deliberate about the effects of this approach on the quality of state scores.
OCR for page 62
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests CONCLUSION 3: Although embedding appears to offer gains in efficiency relative to administering two tests and does reduce student testing time, in practice it is often complex and burdensome and may compromise test security. The relative efficiency of embedding must be evaluated on a case-by-case basis and depends on many factors, including the length of the embedded test, required changes in administration practices at the state level, and differing regulations about which students are tested or excluded. In addition, states must weigh the costs and benefits that are associated with any embedding approach. The committee was able in the time available to consider only briefly the use of embedding to obtain aggregated information rather than to obtain information about individual students. Thus, we do not offer a conclusion on such uses, but rather, a tentative finding. It appears that under some conditions and for some purposes, it may be possible to use embedding to support conclusions other than those pertaining to the performance of individual students. For example, embedding may be a feasible means of benchmarking state standards to national standards in terms of difficulty. That is, it may be practical to find out through embedding whether a state's standards are comparable in difficulty to a set of national standards. This is a relatively undemanding inference, however, because it does not necessarily imply that the state and national assessments are actually measuring similar things or that the particular individuals or schools that score well on one would consistently score well on the other. In other words, it does not entail estimating performance on the national test that is not fully administered. The extent to which embedding would provide valid estimates of aggregated national scores of groups of students—such as schools, districts, or states—on a national test that is not fully administered remains uncertain. Aggregation does lessen the effects of certain types of measurement error that contribute to the unreliability of scores for individual students. Many of the impediments to embedding discussed by the committee, however, vary systematically among groups, such as differences in rules for the use of accommodations and differences in the contexts provided by state tests, and aggregation will not alleviate the distortions caused by these factors.