3
Three Designs for Embedding
The issues raised in the preceding chapter lay the groundwork for evaluating embedding as a method of providing national scores for individual students. In this chapter the discussion focuses on three specific procedures for embedding, illustrated by three scenarios. The scenarios exemplify the general approaches that are most likely to be used at present to obtain national scores for individuals. Variants of these approaches might have somewhat different strengths and weaknesses, but the basic issues that arise in evaluating these three scenarios will apply.
The basis for comparison for evaluating the three embedding scenarios is the administration of both the state and national assessments in their entirety—two free-standing tests—discussed in Chapter 2. As is noted there, if the national test is administered following the procedures that were used when the test was standardized, and if the inferences drawn from the national test are appropriate—two major conditions—the approach can provide comparable national scores for individual students in different states.
Embedding creates at least one and often two changes, compared with two free-standing tests. First, one test is abridged. Second, to varying degrees, embedding generally changes the conditions of administration from those that would exist if the two tests were administered in their entirety and independently.
EMBEDDING WITHOUT ABRIDGMENT OF THE NATIONAL TEST: THE DOUBLE-DUTY SCENARIO
Administering a complete state test along with an entire national test—the two free-standing tests approach—involves some redundancy and wasted resources. The double-duty scenario is an effort to increase efficiency and reduce testing time by having some items from the national test serve a double-duty, contributing to both national and state scores without requiring the student to respond to duplicative items that measure the same construct, once as part of the national test and again as part of a state test. A number of states are currently using this approach in their state testing programs. The committee did not deliberate at length about the change in burden to states that implement the double-duty approach. However, we note that more than 20 states have already adopted this approach of their own accord, which suggests that they find it on balance worthwhile.
Design and Features
Before implementing this approach, state testing experts develop specifications for a state test (see Chapter 2). They compare these specifications with commercially published standardized achievement tests that produce individual student scores. Some of the items in these national tests match the state's test specifications closely; others do not. To gain the most efficiency, the state experts choose the national test that most closely matches their state's test specifications. They then identify the specific items in the national test that measure state standards and custom-develop items to measure any state specifications not sampled by the national test. Some states might use a large part of the national test for generating state scores; others might use very little of it.
Administration
The national test is administered in its entirety under its prescribed standardized conditions. State test items are not physically embedded in the national test, but they are administered close in time under whatever conditions the state determines to be appropriate. This administration procedure protects the national test from context effects.
Scoring and Analysis
As with two free-standing tests, in this scenario a student receives two sets of scores: a national score and a state score. The two scores are not independent because student responses to some national test items count in the state score as well as in the national score.
The national score reflects the entire national test; it is the same as would be obtained if the national test were administered with no connection to the state custom-developed items because the national test is not modified and is administered under its standardized conditions. Thus, the psychometric quality of those scores (reliability, validity) are those the national test normally provides.
The state score reflects a student's responses to two sets of items: all of the custom-developed state items and the subset of items from the national test that pertain to the state's test specifications (see Figure 3-1). This subset might be most or little of the national test, depending on a judgment about the extent of the match between the state's curriculum and the content of the national test.
For the state items, scoring procedures are developed and used as deemed appropriate by the state's educators. It is necessary to keep track of which items from the national test "count" in calculating the state score; it is these items that do double-duty. The state education agency develops scores that meet its needs, such as a state-specific scale score or performance level system or state norms. The state items provide no scores referenced to national norms or performance level.
Evaluation
The double-duty approach differs in a few key respects from the model of embedding that was the focus of Congress' charge to the committee. These differences are central to the evaluation of this approach to embedding.
Advantages
The gains in efficiency from this approach stem from eliminating the redundancy that occurs when students are asked to respond twice, on two different tests, to the same or similar items. The double-duty approach entails no abridgment of the national test. Accordingly, the troublesome issues noted in Chapter 2 that result from abridgment do not apply to the national test in the double-duty scenario.
The national test is administered in all jurisdictions following the procedures prescribed by the test publisher. Assuming that the state items are administered close in time to the national test, rather than at the same time, they are unlikely to change responses to the national test appreciably. For these reasons, the double-duty approach provides national scores for individual students that are essentially the same in quality as those that would be obtained in the absence of embedding.
Since most commercially available large-scale assessments provide different norms for different testing dates, the testing date for the national test is flexible if one of these tests is selected as the national test. The national test can be administered at a time that is convenient and appropriate for the administration of the state test (see Chapter 2).
Disadvantages
The success of the double-duty approach hinges on having an agreed-on national test that provides individual student scores. Currently, there is no such single national test. Thus, comparability of national scores is limited to the states that administer the same national test. Furthermore, this approach cannot be used with national tests that are matrix sampled in a manner that precludes providing individual scores—for example, NAEP.
The degree to which the double-duty approach provides efficiency gains while providing much the same information as would be obtained by administering two free-standing tests in their entirety depends on there being substantial overlap in content between the national test and
state curricula. The weaker the match between the content of the national test and state standards, the less the benefit of efficiency through double-duty items.
The two scores, state and national, may differ due to measurement error or a poor match between a state's curriculum and the national test. Such differences in scores can be confusing to students, parents, school administrators, and the public if they are not clearly explained.
In the double-duty scenario, some of the items that contribute to the state score are administered as part of the national test, rather than with the customized items developed by the state for its own purposes. To the extent that the administration of the national test differs from that of the state's custom items, students' performance on these items may be different than it would have been if they had been administered with the state's custom items because of factors discussed in the preceding chapter. For example, context effects could change students' performance on these items. The committee did not deliberate on the likely effects of these factors on state scores under the double-duty scenario.
Finally, in the current environment of varying but often intense accountability pressures, the national information obtained through the double-duty scenario will sometimes be suspect. That is, if teachers and students feel pressure to raise state scores and if part of the national test contributes to state scores, there may be incentives to engage in the types of inappropriate teaching to the test that can inflate scores. This effect could make the national scores and comparisons among jurisdictions misleading in some instances.
EMBEDDING REPRESENTATIVE MATERIAL: THE NAEP-BLOCKS SCENARIO
More pertinent to Congress' question than the double-duty scenario, but less commonly observed in practice, are embedding approaches in which a national test is abridged and the extract from the national test is embedded in a full state test. One variant of this approach is the NAEP-blocks scenario, in which a portion of NAEP is embedded in a state assessment.
Design and Features
Three blocks taken from either the 8th-grade NAEP mathematics assessment or the 4th-grade reading assessment are administered contem-
poraneously and intact, with separate timing, as a part of the state assessment.1 All students take the embedded blocks along with the state assessment; see Figure 3-2.
The NAEP blocks can be either physically or conceptually embedded in the state assessment. If they are physically embedded, they would presumably be administered first in order to minimize context effects. If
the NAEP blocks are not physically embedded, they would be administered within a short time of the state test.
The NAEP-blocks scenario illustrates one way of choosing between local control and consistency of national testing. In this scenario, local control is limited, and standardization of national testing among states is substantial. States cannot pick items individually; all items within the chosen blocks are used, and no items from other blocks are added.
Administration
In this scenario the state assessment is administered in its entirety. The administration of the embedded blocks mimics the administration of NAEP as much as possible in order to minimize distortions arising from administrative differences. For example, the date of administration would fall near the midpoint of NAEP's range of testing dates. Similarly, electronic calculators would be provided for the items in the selected NAEP mathematics blocks that require their use. NAEP guidelines for inclusion and use of accommodations would be followed. Students who are unfamiliar with the format of NAEP questions would be provided with a pretest orientation.
Administration of three embedded blocks requires approximately 45-75 minutes of testing time. Embedding more blocks would increase the accuracy of student scores and would improve the representation of NAEP content, but at the cost of creating an additional testing burden.
Scoring and Analysis
Students receive two scores: scores that are normally provided from the state assessment and a designation of their NAEP performance level and possibly their NAEP proficiency score, along with an indication of the associated margin of error.
The state score is based on either the state test alone or the state test in conjunction with any NAEP items the state considers appropriate. The national score is based on the NAEP items alone. In theory, the NAEP items could be linked to state items to provide a more reliable estimate of student performance, but as noted in Uncommon Measures (National Research Council, 1999c), this approach faces major obstacles and is generally not practical.
The state assessment is analyzed and scored separately, using the same procedures that are normally used for that state assessment. The
NAEP items are scored separately.2 The item responses are used, with the NAEP item information, to estimate a NAEP proficiency score for each student, as well as the performance level of the student. The quality of the link between student performance on the embedded items and the NAEP scale will depend on the length of the embedded segment and on how well it represents the full national assessment. A more elaborate version of this plan could be used to link the state assessment with the NAEP scale. It would involve a very substantial investment in a unique statistical analysis, and it would be subject to the problems that exist for any link, if the NAEP assessment does not match the state assessment.
Evaluation
As noted, this scenario was chosen to illustrate a relatively high level of standardization across states and a relatively low level of local control.
Advantages
The requirement that states use the same fixed set of NAEP blocks would provide a consistent basis for comparisons among states. In addition, this scenario makes the embedded material more nearly representative of the NAEP assessment than it would be if items were chosen freely by states. In general, the increase in standardization—of content and administration—would increase the comparability of scores across states.
Disadvantages
In practice, the NAEP-blocks scenario faces substantial obstacles. Although states use a fixed set of NAEP blocks, the content of the embedded material would not be fully representative of NAEP. Individual NAEP blocks are not constructed to represent the entirety of the assessment, and even a set of three blocks is likely to provide an unbalanced or less than complete representation of the NAEP assessment. This lack of representativeness would likely be exacerbated if states are restricted to using publicly released NAEP blocks, which is likely, because allowing
widespread use of unreleased NAEP blocks would jeopardize NAEP's security and threaten the integrity of NAEP results.
Even if the content of the embedded blocks were fully representative of NAEP, it would be difficult to obtain scores comparable to the performance estimates provided by NAEP. NAEP uses an elaborate statistical process called ''conditioning" (see, e.g., Beaton and Gonzales, 1995) to adjust for the fact that each student takes a different, small part of the full assessment. This process creates some intermediate computed quantities called "plausible values" for each student, based on both cognitive information (performance on test items) and noncognitive information (characteristics of students). When aggregated with similar values from other test takers, these quantities provide good estimates of the distribution of student performance on the NAEP scale. However, the plausible values are not scores. A different method of scoring, based only on a student's performance on the test items, would be needed for generating individual student scores. One consequence of the changed method of scoring is that the distribution of the resulting scores from embedding would differ from the reported NAEP distributions.
In addition, administration conditions, time of testing, and criteria for excluding students from participation because of disabilities or limited proficiency in English may not be the same for the NAEP items administered with the state test as for the same items administered as part of NAEP. Consequently, states might be required to administer their state assessments at a time that is more appropriate for the embedded test than for their own tests and to follow NAEP guidelines for inclusion and accommodation of students with disabilities and limited proficiency in English.
Motivational differences are another threat to the comparability of scores. Students face no consequences for their performance on NAEP as it is currently administered. In the current climate of accountability, however, they (or their teachers, or both) often face serious consequences for their scores on state tests. This difference could result in scores on the embedded material that are higher than those on NAEP itself for students with identical levels of proficiency.
Context effects could also make scores noncomparable (see Chapter 2). One could minimize these effects by administering the NAEP items separately or at the beginning of the state test. However, such tactics might not suffice to eliminate context effects entirely. For example, if NAEP items were presented first, performance might be affected by the overall directions given at the outset of the test or by the prospect of a
lengthier testing period than that of NAEP. Similarly, even with efforts to standardize administration, some unintended differences in administration might remain, and these could undermine the comparability of scores (see Hartka and McLaughlin, 1994).
Because the precision of scores is in part a function of the length of a test, embedding poses a tradeoff between accuracy and burden. The NAEP-blocks scenario would add 45-75 minutes of testing time per subject, yet it would provide very imprecise estimates of the performance of individual students—too imprecise for many purposes. If estimated scores are used to provide a performance-level classification, the classification would be prone to error.
A study using similar methodology was conducted by McLaughlin (1998), who reported that a 95 percent confidence interval spanned a range of 70 points for an estimated individual NAEP score on 8th-grade mathematics. Given this confidence interval width, approximately 14 percent of the students could be expected to be classified in a level below their true achievement level, and about 16 percent in a higher level, with about 70 percent assigned to the correct performance level.
Finally, embedding NAEP blocks could have undesirable consequences for NAEP. As noted, if secure blocks are used for embedding, the additional exposure of these blocks could undermine the comparability of NAEP scores. For example, if some teachers tailor instruction directly to secure NAEP items because they expect them to appear on state tests, the result could be distortions of comparisons that are based on NAEP scores. NAEP trends might appear more favorable than they really are, and some comparisons among states could be biased. Depending on the degree of similarity between released and secure blocks, embedding released blocks could also threaten NAEP scores, although probably less so.
EMBEDDING UNREPRESENTATIVE MATERIAL: THE ITEM-BANK SCENARIO
A counterpoint to the NAEP-blocks scenario is the item-bank scenario, which entails a great degree of local discretion and accordingly less standardization.
Design and Features
In the item-bank scenario, a set of test items is made available to a state testing agency. These items may come from a well-established
national or international assessment program, such as NAEP or the Third International Mathematics and Science Study (TIMSS), or any other respected source, such as an interstate item development and testing consortium or a commercial test publisher's existing item banks, or they can be created specifically for use in the item bank. The items are calibrated—that is, their difficulty is estimated—with information from a national calibration study. In some respects, the item bank is like a very long national test. State testing agencies choose items from the bank, individually or in sets, and include the selected items in their state tests.
Although item banks can be used in various ways, in this scenario it is assumed that states choose items on the basis of a match with their curricula or other considerations, with no consideration given to maintaining comparability across states in the items selected. States can also choose varying numbers of items to embed.
Selected items are physically embedded in the state assessment and can be either freely interspersed or inserted as one or more discrete blocks. Timing and administrative conditions are determined by the individual state testing programs.
Administration
The selected national items are interspersed in the state test and the state test, including the embedded items, is administered as a single unit; see Figure 3-3.
Scoring and Analysis
Students receive two scores, a state score and a national score. The state score could be based either on the state items alone or on a combination of the state items and some or all of the items chosen from the item bank. Similarly, the national score could be based either on the national items alone or by linking them with some of the state items. In these respects, the item-bank approach is similar to the NAEP-blocks approach.
This design can theoretically produce individual national scores on any national test, including assessments such as NAEP or TIMSS. However, the content of NAEP and TIMSS is broader than that which would normally be covered by an individually administered assessment, let alone a small number of embedded items. Therefore, NAEP and TIMSS are unlikely candidates for the national score that this scenario could produce.
Evaluation
As noted, the item-bank scenario represents the greatest amount of local control and the least amount of standardization across jurisdictions. It is also the only scenario that involves embedding items that are not common to all states.
Advantages
For certain purposes, the item-bank scenario has advantages relative to the NAEP-blocks scenario. For example, embedding items from an item pool responds to the desire to maintain state standards by placing all
control for selecting items, administering a test, and constructing scores with the state testing agency.
For some purposes it is a convenient method for providing localities and states with well-constructed, field-tested items. In some situations it is also very efficient; it allows states to use items relevant to their purposes without expending state resources or testing time on items of little interest to them.
Disadvantages
The item-bank scenario is very poorly suited for the purpose of providing comparable national scores for individual students. For this purpose, it shares the problems noted for the NAEP-blocks design and faces numerous others, as well.
The element of choice entailed by the item-bank scenario undermines the comparability of ostensibly national scores. The subsets of items chosen by states would not necessarily be representative of the item pool itself and would not have to be similar across states. States could choose items on which their students are likely to do particularly well, given their curricula, and avoid those on which their students are likely to do poorly. Indeed, simply attempting to align the selected items with curricula would likely bias scores upward, relative to those that would be obtained if the entire item bank, or a representative sample from it, were used.
In other words, the process of choice would undermine the calibration of items provided by the national calibration sample. By allowing states to choose items freely, the system also allows them to reallocate instructional effort away from excluded items and toward included ones. Such an effect will make included items seem easier, and it would similarly make excluded ones seem harder.
The item-bank scenario also raises problems of item security, some of which are more serious than those raised by the NAEP-blocks scenario. For example, suppose that state A uses certain items in March testing, while state B uses some of the same items in late April. Information on the items used in state A might be obtained by teachers or students in state B, allowing inappropriate coaching that would inflate scores. If the embedded items come from a secure source, such as nonreleased NAEP blocks or commercial test publishers' item banks, embedding them repeatedly in state assessments undermines their security. If the national
item pool is developed from publicly released items, such as released NAEP blocks, issues related to familiarity with the items or inappropriate teaching to the test may undermine the comparability of the scores.
Because the item-bank scenario does not impose uniformity of scheduling or administration, differences in these factors could also undermine the comparability of scores across states. States might differ, for example, in terms of the dates chosen for testing, the placement of embedded items, the degree of time pressure imposed in testing, the inclusion of low-achieving groups, the use of accommodations, or many other aspects of administration. As noted in Chapter 2, each of these factors has the potential to affect scores substantially, thus undermining comparability.
The item-bank scenario poses a considerable state burden for data analysis, and the possibility of untimely reporting of scores. Obtaining a common measure will involve time-consuming, burdensome data analysis of empirical results. Data must either come from pretesting the entire assessment in another jurisdiction or from data from the current assessment. Pretesting must be done a year in advance, to avoid the time of year problem (see Chapter 2). Using data from the current assessment means that scores cannot be reported immediately, but must wait several months for the analyses to be completed.
Finally, states would have to deal with the political difficulty of different test scores from the "same" test administration that rank students differently, since two distinct scores are reported.
EVALUATION OF THE SCENARIOS
The three scenarios differ along several dimensions: the representativeness of the embedded material versus added testing burden for students; the amount of standardization in administration versus the degree of local control; and the extent of the burden placed on states.
A major purpose of embedding is to provide two scores, a national and a state score, without significantly adding to the amount of time a student spends taking tests. The standard of comparison, two free-standing tests, creates the largest testing burden. Since all of the embedding scenarios involve abridging one or the other of the two tests, the testing burden is reduced relative to administering two tests in their entirety. The relative gains in efficiency, however, depend largely on the degree of abridgment. There is a tradeoff: the greater the degree of
abridgment, the greater the likelihood that the abridgment could lead to lower score accuracy (see Chapter 2).
All three scenarios require some change in the state test or testing program. Such changes may interrupt long-term trend information that is of value to the states. For example, some states have developed elaborate test-based accountability systems that rely on longitudinal analysis of test data, the results of which are used to support high stakes rewards and sanctions for schools and districts.
Some states construct their examination forms so that the difficulty level of the overall test is similar from year to year. If the items to be embedded have differing overall difficulty levels, one form of the test could become more difficult than another—particularly if the embedded material changes from year to year.
The validity of the state tests may be compromised if embedded items appear in the middle of an examination and represent material the students have not had the opportunity to learn. For many students, this situation could cause additional anxiety, resulting in a lower score. This is a particular problem when the national test is physically embedded in the state test, as is the case with the item—bank and the NAEP—blocks scenarios.
Other issues also arise. For example, if the national test that is selected has norms at only one grade and testing date (e.g., the proposed VNT or NAEP), the state agency must administer that test at the grade and testing date dictated by the national test, even if it is not an optimum time for the state's own test to be administered. If states want different items to appear in the state test in different years, for security or other purposes, the state education agency will have to construct tests that are parallel in content and psychometric characteristics from year to year and to perform appropriate equating analyses. This is a difficult and costly endeavor.
If national items are physically embedded in state tests, the various accommodations that are made available for the state tests would have to be available for the embedded items. Suppose each state makes all of its accommodations available for the national items (which at best may involve considerable work and expense and at worst may not be possible): then the national items will vary from state to state in how they are administered, which violates an essential condition for obtaining a common measure. This result effectively renders the results from the national items noncomparable across states—unless all of the states can be convinced to offer the same accommodations. When the national test is
administered separately, differences in accommodation practices among states will not affect national scores, but they will affect the type of information that is available for accommodated students. These students will earn only a partial state score. Partial data decreases the amount and type of achievement information that can be made available to them, their parents, and their teachers.
All of the embedding scenarios make considerable demands on local resources in terms of development, analysis of test items, or both. For example, in the NAEP-blocks scenario, although the National Assessment Governing Board (NAGB) selects and makes available a set of NAEP blocks, together with the procedures for scoring them, states will have to train staff to score the constructed-response items or else contract with the organization that scores NAEP items for NAGB. The demand on state resources is even greater for the item-bank scenario, since much technical work is involved at the item level and a linking study would be required. All three scenarios can also place additional financial burdens on states, especially true if states are asked to bear any of the cost for development of new state tests, data analysis, additional scoring, or printing and distributing new test booklets.
All three scenarios also raise the problem of timelines. Timeliness in reporting scores requires much advanced preparation, and extensive data analysis, which may strain local resources. In the NAEP-blocks and the item-bank scenarios, if local items are to be combined with the national items, the necessary analyses cannot readily be done before giving the assessment. But if the analyses must be done after the assessment has been given, the scores will not be available to students, parents, or teachers in a timely fashion. A student beginning the 9th grade might not benefit from learning that his or her 8th-grade mathematics performance was at a basic level. Students and their teachers want to know how they are doing now, not how they did 6 months ago. In the double-duty scenario, the national score can be provided very quickly, but the local data might be slow in coming, because student scores on the state items cannot be provided before national test results are made available.
Although embedding appears at first to be a practical answer to policy makers' goal of obtaining data on student achievement relative to national standards with little or no added test burden for students and minimum disruption of state testing programs, myriad problems, as illustrated by these three scenarios, make that goal elusive.