Read "Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests" at NAP.edu

Page 56 Cite

Suggested Citation:"Common Measures for Purposes Other Than Individual Scores." National Research Council. 1999. Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests. Washington, DC: The National Academies Press. doi: 10.17226/9683.

×

4
Common Measures for Purposes Other Than Individual Scores

Although education policy makers are interested in using embedding to develop comparable measures of individual student performance, many also want to know if embedding can be used to develop a common measure that can be used for other purposes. Those purposes include: reporting aggregated statistics for schools or districts on the scale of NAEP or another well-regarded national test; comparing state performance standards against national performance standards that are considered rigorous; reporting state performance on the NAEP scale in years when state NAEP is not administered or when particular subjects are not included in the NAEP assessment for that year; and auditing yearly gains on state tests. Accordingly, we comment briefly in this chapter on a number of such potential uses of embedding parts of a national test in state tests.

In considering the feasibility of using embedding to develop a common measure of aggregate performance, we use the same definition of a common measure that we use throughout the report: a common measure is a single scale of measurement; scores from tests that are calibrated to such a scale support the same inferences about student performance from one locality to another and from one year to the next.

The requirements for valid score interpretation are no less challenging in this context (aggregated results) than they are in the more familiar individual-differences context. Moreover, the evidence that might support the interpretations and uses of the test scores for individual students

Page 57 Cite

Suggested Citation:"Common Measures for Purposes Other Than Individual Scores." National Research Council. 1999. Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests. Washington, DC: The National Academies Press. doi: 10.17226/9683.

×

does not necessarily support the interpretations and policy uses of aggregated results (Linn, 1993a:5).

Many of the threats to inferences from embedded material stem from systematic differences among jurisdictions (see Chapter 2), which pose obstacles to the use of embedding to provide aggregated national scores for groups (e.g., schools or districts), just as they impede the provision of individual student scores. Below is a very brief discussion about the use of embedding to develop a common measure for aggregates.

PROVIDING NATIONAL SCORES FOR AGGREGATES

States may be interested in obtaining national scores for aggregates, such as schools or districts. These aggregated national scores might be tied to NAEP or to another national test. Currently, the National Assessment Governing Board (NAGB) is considering options for providing district results for some districts that meet particular guidelines for participation. Their plans, which were discussed at the March 4-6, 1999, May 13-15, 1999, and August 5-7, 1999, board meetings, do not rely on either embedding or linking.

How is providing a common measure of district and school performance from embedding NAEP items or blocks in state tests the same as or different from providing a common measure of individual performance that is derived from embedded items? On the positive side, some of the things that affect the scores earned by individual students will average out in the aggregate. For example, students have good days and bad days, depending on their health, mood, amount of sleep, and so on. These factors can cause students' individual scores to fluctuate from day to day. In the aggregate, however, these fluctuations tend to average out and will therefore have less effect on the average test score earned by an entire school, and less yet on the average score obtained for an entire district. When only errors of this sort are involved, the precision of the estimate increases with the size of the group on which it is based.

Similarly, the decrease in the reliability of individual scores that is caused by abridging the content of the national test to facilitate embedding is somewhat mitigated when aggregated scores are calculated. In addition, assessments that do not produce individual student scores can be designed to lessen the effect of abridgment by using a matrix-sampled design. With a design of this sort, individual students are administered abridged and sometimes unrepresentative portions of the test, but aggre-

Page 58 Cite

Suggested Citation:"Common Measures for Purposes Other Than Individual Scores." National Research Council. 1999. Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests. Washington, DC: The National Academies Press. doi: 10.17226/9683.

×

gated scores will still reflect the entirety of the test. This approach is used in both TIMSS and NAEP and could be used in an embedding design as well, as long as aggregated scores are the purpose of the embedding.

On the negative side, however, many of the potential threats to valid inferences about individual students (discussed in Chapter 2) do not average out and therefore also pose serious threats to aggregated scores. These are factors that differ from one aggregate (e.g., classroom, school, or district) to another, not from one student to the next. For example, as discussed in Chapter 2, a variety of differences in context and administration could bias estimates of national scores for individual students. Students with the same level of mastery of the material should receive similar scores, but if a test is administered to them differently, they might obtain dissimilar scores solely because of those differences in administration. Among these differences in context and administration are decisions about which students are tested or excluded, the types of accommodations offered to students with special needs, and the dates on which tests are administered. These factors do not vary among students within a group, but between groups. For example, two states may set different dates for test administration, but all students within each state will take the test at approximately the same time. When a given factor does not vary within the aggregate—whether it be a school, a district, or an entire state—combining results from students within that group will not average out its effects.

This problem is illustrated by rules for the inclusion of students with disabilities or with limited proficiency in English. State rules for the inclusion of these students in state testing programs vary markedly. The 1998 Annual Survey of State Student Assessment Programs, conducted by the Council of Chief State School Officers, indicated that most states leave decisions about the exclusion of students with limited English proficiency from state assessments to local committees or to the schools themselves (Olson et al., in press). In some states, such as California and New Mexico, such students account for more than 20 percent of the total, and the lack of comparability of inclusion guidelines could have a significant effect on state test results. The passage of the 1997 amendments to the Individuals with Disabilities Education Act (IDEA) is expected to lead to somewhat greater uniformity in inclusion practices for students with disabilities, but the decision regarding inclusion rests with school officials in most states, and there still may be significant state-to-state differences regarding which students are tested.

Page 59 Cite

Suggested Citation:"Common Measures for Purposes Other Than Individual Scores." National Research Council. 1999. Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests. Washington, DC: The National Academies Press. doi: 10.17226/9683.

×

The most recent NAEP state-by-state reading results illustrate the effect that different decisions on inclusion may have on exam results. A study conducted by the Educational Testing Service for the National Center for Education Statistics found that an increase in the number of low-achieving students excluded from the assessment could boost the apparent increase in states' reading scores (Mazzeo et al., 1999). A worst-case model found that gains posted by at least two states might have been influenced appreciably by such increases in exclusion. Similarly, differences in the accommodations offered to students with disabilities who are included in the assessment can substantially alter aggregated scores (Halla, 1988; Huesman, 1999; Rudman and Raudenbush, 1996; Whitney and Patience, 1981). In comparing the scores from state testing programs, it is important to note that states do not uniformly include scores earned by disabled and limited-English-proficient students who were allowed accommodations during testing in their aggregated score summary reports (Olson et al., in press).

STATE PERFORMANCE STANDARDS

Although it has never been formally published, Musick's ''Setting Education Standards High Enough" is one of the most frequently requested publications produced by the Southern Regional Education Board. In it, Musick (1996:1) succinctly presents the issue of varying state standards:

If [states] don't talk to each other, the odds are great that 1) many states will set low performance standards for student achievement despite lofty sounding pronouncements about high standards, and 2) the standards for student achievement will be so dramatically different from state to state that they simply won't make sense . . . If what is taught in eighth grade mathematics in one state is much the same as what is taught in eighth grade mathematics in another state, how do we explain that one state has 84 percent of its students meeting its performance standards for student achievement while another state has 13 percent of its students meeting its standard? Do we really believe that this dramatic difference is in what these eighth grade students know about mathematics? Or is it possible that much of the difference is because one state has a low performance standard for student achievement and the other has a higher standard.

Its release in 1996 led policy makers across the country to ask, "Are our state's standards high enough?" To answer policy makers' question—and the related concern that some state standards may be unrealistically

Page 60 Cite

Suggested Citation:"Common Measures for Purposes Other Than Individual Scores." National Research Council. 1999. Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests. Washington, DC: The National Academies Press. doi: 10.17226/9683.

×

high—it has been suggested that what is needed is corroborative data from a national assessment on which standards are rigorous and widely accepted. The reporting metrics of NAEP, TIMSS, and the Organization for Economic Co-operation and Development's Programme for International Student Assessment (PISA) are mentioned as possible rulers against which state policy makers could gauge the relative difficulty of their performance standards.

The desire for this type of information leads to the question of whether strategies to embed items taken from one of these tests can be implemented for this purpose. Embedding has been used in this way in Louisiana. Hill and his associates (Childs and Hill, 1998) embedded released NAEP blocks in a field test of items for the new Louisiana Educational Assessment Program (LEAP) in order to put the LEAP items and the NAEP items on the same proficiency scale. They used the scale to compare the Louisiana performance standards with the NAEP performance standards. The main goal was simply to see if the Louisiana standards were as difficult as the national standards. The result of their study was that the state standards were deemed to be at least as difficult as the NAEP standards.

ESTIMATING STATE NAEP RESULTS IN YEARS THAT STATE NAEP IS NOT ADMINISTERED

States may be interested in obtaining estimates of performance relative to NAEP achievement levels or the NAEP scale for years when the NAEP state assessment is not administered in order to monitor progress and support trends with additional data points. Some policy makers and researchers have expressed an interest in using linking or embedding to obtain these estimates from state testing programs (McLaughlin, 1998; Bock and Zimowski, 1999).

If embedding is to be used for this purpose, the issues that arise are much the same as those that arise in any effort to link a state test to the NAEP scales or interpret the results in terms of NAEP performance standards (see Chapter 2). They are also the same as those issues that arise when trying to provide lower-level aggregated national scores by embedding NAEP items in state tests (see discussion above).

Page 61 Cite

Suggested Citation:"Common Measures for Purposes Other Than Individual Scores." National Research Council. 1999. Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests. Washington, DC: The National Academies Press. doi: 10.17226/9683.

×

AUDITING THE RESULTS OF DISTRICT AND STATE ASSESSMENTS

Some states (or critics of state programs) are interested in using results from state NAEP or other tests, such as commercially available, norm-referenced tests, to validate gains on state tests. They argue that if gains on a state test are meaningful, they should be at least partly reflected in the states' performance on a well-respected external measure of student performance that tests the same subject area.

Auditing of this sort can be done on a limited scale with no linking or embedding whatsoever. For example, Hambleton et al. (1995) and Koretz and Barron (1998) evaluated gains on Kentucky's state test by comparing trends to those on state NAEP. However, the advantages and disadvantages of embedding national items in state tests to validate gains on state tests remain largely unexplored. It is not clear whether embedding would increase or decrease the accuracy of the inferences from auditing. Moreover, embedding NAEP blocks or material from any commercially available norm-referenced test could have undesirable consequences for the national test that serves as the source of the embedded items, especially if secure NAEP blocks are used for embedding. The additional exposure of these blocks could undermine the comparability of NAEP results, both across jurisdictions and over time. Thus, this use of embedding could necessitate increased development of test items and equating of those new items with existing items