Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests Executive Summary Policy makers are caught between two powerful forces in relation to testing in America's schools. One is increased interest on the part of educators, reinforced by federal requirements, in developing tests that accurately reflect local educational standards and goals. The other is a strong push to gather information about the performance of students and schools relative to national and international standards and norms. The difficulty of achieving these two goals simultaneously is exacerbated by both the long-standing American tradition of local control of education and the growing public sentiment that students already take enough tests. Finding a solution to this dilemma has been the focus of numerous debates surrounding the Voluntary National Tests proposed by President Clinton in his 1997 State of the Union address. It was also the topic of a congressionally mandated 1998 National Research Council report (Uncommon Measures: Equivalence and Linkage Among Educational Tests), and was touched upon in a U.S. General Accounting Office report (Student Testing: Issues Related to Voluntary National Mathematics and Reading Tests). More recently, Congress asked the National Research Council to determine the technical feasibility, validity, and reliability of embedding test items from the National Assessment of Educational Progress or other tests in state and district assessments in 4th-grade reading and 8th-grade mathematics for the purpose of developing a valid measure of student
OCR for page 2
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests achievement within states and districts and in terms of national performance standards or scales. This report is the response to that congressional mandate. CONCEPT AND PURPOSE OF EMBEDDING Underlying the committee's discussion of embedding there are always two tests, which we identify as the ''national test" and the "state test." The national test might be an actual test or testing program like the National Assessment of Educational Progress (NAEP) or one of the commercially available achievement tests, or it might be some other large pool of nationally calibrated test items. Performance on the national test items generates a "national score," the candidate for a common measure of individual student performance. The state test is whatever state or local testing program is already in place, and it produces a "state score" for students that is distinct from the national score. The goal of embedding is to produce both the national score and the state score without administering two full-length, free-standing tests. Key to achieving that goal is the need for a common measure of student performance. A common measure is a single scale of measurement; scores from tests that are calibrated to this scale support the same inferences about student performance from one locality to another and from one year to the next. A given score indicates the same level of performance, no matter from which test or how the score was obtained. The scores might be obtained from a single test, from different tests that are calibrated to the same scale through linking, from extracts from a single test, or based on estimates of student performance from a matrix-sampled assessment. Validity is the central criterion for evaluating any inferences based on test scores. When inferences about students' educational achievements are intended from test results, two things are critical: (1) the test must adequately sample the domain of knowledge and skills that the scores are supposed to represent, and (2) the test must always be administered under the same standardized conditions so that all test takers have the same opportunity to demonstrate what they know. Developing a common measure of individual student performance by inserting an abridged test into the diversity of current state tests creates multiple opportunities for these two conditions to be violated, threatening the validity of most of the inferences that parents, educators, and policy makers want to support with test scores.
OCR for page 3
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests The type of embedding that the committee considered to be most central to its charge entails including parts of a national assessment in state assessment programs in order to provide individual students with national scores that are comparable to the scores that would have been obtained had they taken the national assessment in its entirety. CONCLUSIONS National scores that are derived from an embedded national test or test items are likely to be both imprecise and biased, and the direction and extent of bias is likely to vary in important ways—e.g., across population groups and across schools with different curricula. The impediments to deriving valid, reliable, and comparable national scores from embedded items stem from three sources: differences between the state and national tests; differences between the state and national testing programs, such as the procedures used for test administration; and differences between the embedded material and the national test from which it is drawn. CONCLUSION 1: Embedding part of a national assessment in state assessments will not provide valid, reliable, and comparable national scores for individual students as long as there are: (1) substantial differences in content, format, or administration between the embedded material and the national test that it represents; or (2) substantial differences in context or administration between the state and national testing programs that change the ways in which students respond to the embedded items. If the national assessment is administered in its entirety, close in time with a state assessment, and in a manner that is consistent with its standardization, many of the threats to comparability of national scores—such as context effects, differences in timing, and differences in administration—may be circumvented. In this situation, if state scores are not intended to be comparable across states, it does not matter that this approach may lead some states to administer their own test material differently than some other states. This approach is not without its limitations, however, and it can affect a state's testing programs in a variety of ways. State policy makers and educators must weigh the advantages, disadvantages, and tradeoffs that are associated with this approach.
OCR for page 4
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests CONCLUSION 2: When a national test designed to produce individual scores is administered in its entirety and under standard conditions that are the same from state to state and consistent with its standardization, it can provide a national common measure. States may separately administer custom-developed, state items close in time with the national test and use student responses to both the state items and selected national test items to calculate a state score. This approach provides both national and state scores for individual students and may reduce students' testing burdens relative to the administration of two overlapping tests. The relative efficiency of embedding must be evaluated on a case-by-case basis and depends on many factors, including the length of the embedded test, required changes in administration practices at the state level, and differing regulations about which students are tested or excluded. States must weigh the costs and benefits that are associated with any embedding approach. However, differences in the time of year for testing, grades and subjects tested, content and format of the national and state tests, rules about assessment accommodations, the stakes associated with test results, and the uses and types of testing aids that are required and provided by different states create a situation that makes embedding items in state and district tests to derive a common measure of individual student performance both complex and burdensome. CONCLUSION 3: Although embedding appears to offer gains in efficiency relative to administering two tests and does reduce student testing time, in practice, it is often complex and burdensome and may compromise test security. The committee also considered other purposes for which embedding might be used to obtain aggregate information, i.e., scores of groups of students such as schools, districts, or states, rather than to obtain information about individual students. The extent to which embedding would provide valid estimates of aggregated scores on a national test that is not fully administered remains uncertain. Aggregation does lessen the effects of certain types of measurement error that contribute to the unreliability of scores for individual students. But many of the impediments to embedding are factors that vary systematically among groups, such as differences
OCR for page 5
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests in rules for the use of accommodations (for students with disabilities or limited English proficiency) and differences in the contexts provided by state tests. Aggregation will not alleviate the distortions in the scores that are caused by these factors. Given the limited data available on this issue, the committee does not offer a conclusion about the use of embedding to obtain aggregate information.
Representative terms from entire chapter: