1
Introduction: History and Context

Policy makers are caught between two powerful forces when it comes to testing in America's schools. One is the increased interest on the part of educators, reinforced by federal requirements, in developing tests that accurately reflect local educational standards and goals. The other is a strong policy push to gather information about the performance of students and schools relative to national and international standards and norms. The difficulty of simultaneously achieving these two goals is exacerbated by both the long-standing American tradition of local control of education and growing public sentiment that the nation's school children already face enough tests.

The search for a solution to this dilemma led Congress to request two separate studies from the National Research Council (NRC) to determine whether a common measure of student performance can be achieved by comparing or linking the results of different tests to each other and interpreting the results in terms of national or international benchmarks.

BACKGROUND

Despite significant state investments in standards and testing and in education generally, policy makers continue to look for clear evidence of how their states' students perform in comparison with students in other states and with national and international standards. The growing demand



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 6
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests 1 Introduction: History and Context Policy makers are caught between two powerful forces when it comes to testing in America's schools. One is the increased interest on the part of educators, reinforced by federal requirements, in developing tests that accurately reflect local educational standards and goals. The other is a strong policy push to gather information about the performance of students and schools relative to national and international standards and norms. The difficulty of simultaneously achieving these two goals is exacerbated by both the long-standing American tradition of local control of education and growing public sentiment that the nation's school children already face enough tests. The search for a solution to this dilemma led Congress to request two separate studies from the National Research Council (NRC) to determine whether a common measure of student performance can be achieved by comparing or linking the results of different tests to each other and interpreting the results in terms of national or international benchmarks. BACKGROUND Despite significant state investments in standards and testing and in education generally, policy makers continue to look for clear evidence of how their states' students perform in comparison with students in other states and with national and international standards. The growing demand

OCR for page 6
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests for national and international comparative achievement data is reflected in the growing public attention to results of such assessments as the National Assessment of Educational Progress (NAEP) and the Third International Mathematics and Science Study (TIMSS), but these programs do not provide individual student results. National comparability of individual test results is difficult to attain. The United States does not have a national examination system that can show how an individual student's achievement compares with that of students in other schools, districts, and states. There is no uniform curriculum for each school subject or commonly accepted standards of academic performance. Instead, individual student achievement is currently measured by a variety of state-developed and commercially published tests. State tests are designed to evaluate students, schools, and school districts with respect to state goals, but they do not provide information that is useful in making comparisons across states. Standardized commercial tests can provide information for making comparisons across states among students who take the same test, but they cannot provide a common measure of achievement for students who take different tests, even when these tests appear to be similar (National Research Council, 1999c). Differences across states go deeper than the specific tests they choose to use, to the actual goals and standards for learning in each subject area. There is no national consensus, for example, on exactly what constitutes the subject areas of 4th-grade reading and 8th-grade mathematics, nor on what mathematical skills an 8th-grade student ought to have mastered, nor on what constitutes reading and writing competence of a 4th-grade student. Thus, different tests that ostensibly measure the same broad subject area can produce varying scores for the same students because the tests may emphasize different aspects of the subject area, such as algebra, computation, or graphical representation for 8th-grade mathematics. The lack of a readily available, nationally accepted "common currency" for describing and comparing individual student achievement leaves policy makers wondering what they can tell students and their families about how local students are performing relative to other students in the nation. The first NRC study addressed the question of the feasibility of developing an equivalency scale that would allow test scores from commercially available standardized tests and state assessments to be compared with each other and NAEP. The linkage study (National Research Council, 1999c) concluded that state assessments and commercial tests

OCR for page 6
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests are too diverse to be meaningfully linked to a single common scale and that reporting student scores from different assessments on the same scale is therefore not feasible. Although some of the measures might be sufficiently similar in content and format to be linked, the study concluded that differences in administrative practices and test uses would limit the valid inferences that might be drawn about individual students. The study also concluded that linking an existing test or assessment to the NAEP scale is problematic unless the test to be linked to NAEP is very similar in content, format, and uses to NAEP. Policy makers accepted the report's conclusions, but the pressure to find ways to address the divergent goals of score comparability and local control of education did not disappear. In continuing to seek a viable means of deriving a common measure of student performance, and to do so efficiently, policy makers responded to the NRC report with several follow-up questions: Is there a way to combine elements of two different tests and get meaningful results for both? Can NAEP items or items from other nationally standardized tests simply be embedded in state tests in order to provide information related to national standards? Can one "sprinkle" a few items from one test in another test and lift the results out separately? Can one test be "attached" to or "contained within" another test? Are tests similar enough that common items can be found and used for different purposes at the same time? At the same time that the NRC's linkage study was under way, preliminary work by Achieve, Inc., an independent policy organization, indicated widespread interest in trying to find strategies to answer those questions (Kronholz, 1998; Hoff, 1998).1 After the NRC report, the notion of embedding items from one test in another to develop a common measure of student performance was thrust even more into the spotlight 1    After careful consideration of the issues surrounding the selection of items to be used for embedding and the potential technical and practical difficulties associated with embedding the identified items in differing state tests, Achieve abandoned its attempt to develop such strategies (Achieve, Inc., personal communication, March 13, 1999; Hoff, 1999).

OCR for page 6
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests as a possible solution to the dilemma of score comparability with only a limited additional testing burden placed on the states. In response, the Committee on Embedding Common Test Items in State and District Assessments was charged specifically (under P.L. 105-277) with examining research and practice to determine whether embedding NAEP or other items in state and district tests of 4th-grade reading and 8th-grade mathematics is a technically feasible way of obtaining a valid and reliable common measure of individual student performance. COMMITTEE'S APPROACH In accepting its charge, the committee acknowledged that the questions posed to it are important ones that reflect policy makers' keen desire for nationally comparable student achievement measures that can be developed without adding additional testing burdens to state programs. Therefore, in conducting its deliberations, the committee used the ability to achieve comparability with efficiency as one criterion for evaluating different strategies for embedding items to develop a common measure of individual student performance. The committee made the assumption that the possibility of linking or embedding items in existing tests was being proposed as an alternative to the Voluntary National Tests (VNT) of 4th-grade reading and 8th-grade mathematics that were requested by President Clinton in his 1997 State of the Union address to Congress. While the committee takes no position on the overall merits of the VNT, it acknowledges that some of its findings and conclusions may be relevant to the technical and policy issues surrounding the tests. The committee began by reviewing and accepting the evidence, conclusions, and relevance of two earlier related reports to Congress: Uncommon Measures: Equivalence and Linkage Among Educational Tests (National Research Council, 1999c) and Student Testing: Issues Related to Voluntary National Mathematics and Reading Tests (U.S. General Accounting Office, 1998). Because the committee accepted the conclusions of Uncommon Measures regarding the issues surrounding equating and linking, the committee focused its deliberations on the use of embedded items to develop a common measure that is not derived from linking or equating. Although the congressional conference agreement (U.S. Congress, 1998) that elaborated the committee's charge specifically states that, ". . . including items from one test in another test for the purpose of providing

OCR for page 6
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests a common measure of individual student performance is, effectively, a form of linking . .," the committee considered the full range of embedding techniques, including some that do not entail statistical linking. The committee deliberated about the ways in which using embedding to develop a common measure of student achievement are the same as or different from linking. Definitions To facilitate its discussions, the committee formalized several key definitions and developed three scenarios of ways in which embedding could be implemented. Embedding Embedding is the inclusion of all or part of one test in another. In this report, however, embedding refers only to the inclusion of part of a test in another, since embedding all of a test offers no gains in efficiency over administering two tests separately. Accordingly, the focus of this report is a discussion of methods of embedding that entail varying degrees of abridgment of either the test from which embedded material is drawn or the test into which another test is embedded. There are tradeoffs imposed by the method and degree of abridgment—how the embedded material is selected from the entire test and how much of the entire test is included. For example, embedding larger amounts of material is likely to increase the reliability of scores, but at the cost of increasing the testing burden. Underlying our discussion of embedding there are always two tests, which we call the "national test" and the "state test." The national test might be an actual test or testing program like NAEP or one of the commercially available achievement tests, or it might be some other large pool of nationally calibrated test items. In either case, performance on the national test items generates a "national score,'' the candidate for a common measure of individual student performance. The "state test is whatever state or local testing program is already in place, and it produces a "state score" for students that is distinct from the national score. The goal of embedding is to produce both a national score and a state score without administering two full-length, free-standing tests. Of course, embedding could take other forms, and the issues raised here would apply to

OCR for page 6
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests them as well. However, because of NAEP's design, embedding NAEP material raises additional concerns (detailed in Chapter 2). Two methods of embedding are included in our analysis: physical and conceptual. Physical embedding entails inserting material from the national test into a state's test booklets, either as a separate section of the state test or sprinkled throughout the state test. Conceptual embedding requires that the material from the national test be administered separately but close in time to the state test. Most of the embedding issues that the committee discusses arise in both cases, but conceptual embedding can be less subject to context effects (discussed in Chapter 2). A Common Measure The committee was charged with examining the usefulness of embedding items in state and district tests for the purpose of providing a common measure of individual student performance. But what is a "common measure?" A common measure is a single scale of measurement; scores from tests that are calibrated to this scale support the same inferences about student performance from one locality to another and from one year to the next. To provide a common measure, tests must conform to technical standards (American Educational Research Association et al., 1985; American Educational Research Association et al., in press) and must meet a number of additional criteria, some of which are discussed below. In addition, it should be noted that even tests that provide a common measure may differ in reliability—that is, scores from one may be more precise than scores from another. A given score indicates the same level of performance, no matter from which test or how the score was obtained. The score might come from performance on a single test, from different tests that are calibrated to the same scale through linking, from extracts from a single test, or from estimates of student performance from a matrix-sampled assessment. A common measure does not necessarily imply a common or shared test. Common measures can be obtained from a common test that is always administered under standardized conditions, but they need not be. The motivation for this study, and for the study of linking reported in Uncommon Measures, is a widespread interest in obtaining comparable information about student performance without a common test: that is, without administering a full, common test in different states. Uncommon

OCR for page 6
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests Measures (National Research Council, 1999c) explored whether linking could provide a common measure from different tests when no common test is used at all. This study explores whether embedding might serve that function—in particular, embedding parts of a common test into different state or district tests. Three Scenarios To make the issues we raise more concrete, we developed three specific scenarios around which we organize our discussion about embedding for a common measure of individual performance (discussed in Chapter 3). We use the administration of two free-standing tests (discussed in Chapter 2) as a standard with which to compare the three embedding scenarios. Although we believe that these three scenarios illustrate the most likely approaches to embedding, they do not represent an exhaustive inventory of embedding techniques: Double-duty scenario: In this scenario, a national test is administered independently of a state test, but some or all of the items from the national test are used with the state items in developing students' state scores. NAEP-blocks scenario: In this scenario, NAEP item blocks, which have been chosen to represent the complete NAEP assessment to some degree, are inserted into a state test booklet. Item-bank scenario: In this scenario, a national item bank is made available to local educational agencies, and state educators select the items they wish to use and embed them in their state tests. Details about the design, analysis, and reporting for these scenarios are presented in Chapter 3, along with an evaluation of their technical quality for the purpose of producing a common measure of individual student performance. Our evaluation of these scenarios illustrates the advantages, disadvantages, and tradeoffs that are inherent in any proposal for creating a common measure through embedding. Broader Issues Although we focus mostly on whether a common measure of individual performance can be developed by embedding all or part of a test in

OCR for page 6
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests another test, we identified a variety of other purposes for which policy makers may want a common measure of student performance, and we expanded our deliberations to consider them. They include: to report national test results from NAEP or other tests at the district or school level, to verify the level of rigor of local standards, to report NAEP results in non-NAEP administration years, and to audit changes in local test results over time. Because these purposes involve comparisons of group performance, aggregated scores (scores representing a group of individuals, such as a school, district, or state) would be more useful than individual scores. We note some important attributes of these alternatives, but we did not deliberate about them at length. Chapter 4 reports our limited findings and conclusions about these other purposes for embedding. Some of the conclusions contained in this report reflect the current diversity of state curricula and tests. If the goals and characteristics of state testing programs were to become markedly more similar than they currently are, some of the obstacles to embedding noted here would be ameliorated to some degree. However, recent developments do not suggest that this is likely to happen in the near future. In addition, we note that the impediments to successful embedding noted here vary considerably in terms of their tractability. Some of them could be surmounted by simple decisions about the operation of state testing programs, while others cannot be overcome without fundamental changes in curriculum and assessment.