PAPERBACK
\$34.75

## 2Environment for Embedding: Technical Issues

This chapter describes a number of issues that arise when embedding is used to provide national scores for individual students. In keeping with Congress's charge, we focus our attention primarily on embedding as a means of obtaining individual scores on national measures of 4th-grade reading and 8th-grade mathematics. The issues discussed here would arise regardless of the grade level or subject area, although the particulars would vary.

### SAMPLING TO CONSTRUCT A TEST1

To understand the likely effects of embedding, it is necessary to consider how tests are constructed to represent subject areas. For present purposes, a key element of this process is sampling. A national test represents a sample of possible tasks or questions drawn from a subject area, and the material to be embedded represents a sample of the national test.

Tests are constructed to assess performance in a defined area of knowledge or skill, typically called a domain. In rare cases, a domain may be

 1 This material is a slight revision of a section of Uncommon Measures (National Research Council, 1999c:12-14)

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 14
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests 2 Environment for Embedding: Technical Issues This chapter describes a number of issues that arise when embedding is used to provide national scores for individual students. In keeping with Congress's charge, we focus our attention primarily on embedding as a means of obtaining individual scores on national measures of 4th-grade reading and 8th-grade mathematics. The issues discussed here would arise regardless of the grade level or subject area, although the particulars would vary. SAMPLING TO CONSTRUCT A TEST1 To understand the likely effects of embedding, it is necessary to consider how tests are constructed to represent subject areas. For present purposes, a key element of this process is sampling. A national test represents a sample of possible tasks or questions drawn from a subject area, and the material to be embedded represents a sample of the national test. Tests are constructed to assess performance in a defined area of knowledge or skill, typically called a domain. In rare cases, a domain may be 1    This material is a slight revision of a section of Uncommon Measures (National Research Council, 1999c:12-14)

OCR for page 14

OCR for page 14
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests FIGURE 2-1 Decision stages in test development.

OCR for page 14
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests in terms of the content to be included, and the processes that students must master in dealing with the content. The NAEP 8th-grade mathematics framework represents choices about how to assess achievement in the content of 8th-grade mathematics. It identifies conceptual understanding, procedural knowledge, and problem solving as facets of proficiency and whether basic knowledge, simple manipulation, and understanding of relationships are to be tested separately or in some context. Choices made at the next stage, test specification, outline how a test will be constructed to represent the specified content and skills areas defined by the framework. Test specifications, which are aptly called the test blueprint, specify the types and formats of the items to be used, such as the relative number of selected-response items and constructed-response items. Designers must also specify the number of tasks to be included for each part of the framework. Some commercial achievement tests, for example, place a much heavier emphasis on numerical operations than does NAEP. Another choice for a mathematics test is whether items can be included that are best answered with the use of a numerical calculator. NAEP includes such items, but the Third International Mathematics and Science Survey (TIMSS), given in many countries around the globe, does not. The NAEP and TIMSS frameworks are very similar, yet the two assessments have different specifications about calculator use. Following domain definition, framework definition, and test specification, the final stage of test construction is to obtain a set of items for the test that match the test specification. These can come from a large number of prepared items or they can be written specifically for the test that is being developed. Newly devised items are often tried out in some way, such as including them in an existing test to see how the items fare alongside seasoned items. Responses to the new trial items are not included in the score of the host test. Test constructors evaluate new items with various statistical indices of item performance, including item difficulty, and the relationship of the new items to the accompanying items. COMMON MEASURES FROM A COMMON TEST To clarify the distinction between common tests and common measures, and to establish a standard of comparison for embedding, we begin our discussion of methods for obtaining individual scores on a common measure with an approach that entails neither linking nor embedding,

OCR for page 14

OCR for page 14

OCR for page 14

OCR for page 14
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests FIGURE 2-2 Manipulatives allowed on 4th-grade reading and 8th-grade mathematics components; number of states. Source: Adapted from Olson et al. (in press). the staff read the instructions for completing the test to the examinees from a script, which is designed to ensure that all students receive the same instructions and the same amount of time for completing the test. If all of the test administrators adhere to the standardized procedures, there is little cause for concern. There has been some concern expressed, however, by measurement specialists that teachers may vary in how strictly they adhere to standardized testing procedures (see, e.g., Kimmel, 1997; Nolen et al., 1992; Ligon, 1985; Horne and Garty, 1981). If different states provide different directions for the national test, different opportunities to use calculators or manipulatives (see Figure 2-2), impose different time limits for students, or break the test into a different number of testing sessions, seemingly comparable scores from different states may imply different levels of actual proficiency.

OCR for page 14
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests Accommodations One of the ways in which the standardized procedures for administration are deliberately violated is in the provision of special accommodations for students with special needs, such as students with disabilities or with limited proficiency in English. Accommodations are provided to offset biases caused by disabilities or other factors. For example, one cannot obtain a valid estimate of the mathematics proficiency of a blind student unless the test is offered either orally or in Braille. Other examples include extra time (a common accommodation), shorter testing periods with additional breaks, and use of a scribe for recording answers; see Table 2-1 for a list of accommodations that are used in state testing programs. Two recent papers prepared by the American Institutes for Research (1998a, 1998b) for the National Assessment Governing Board summarize much of the research on inclusion and accommodation for limited—English-proficient students and for students with disabilities. However, information about the appropriate uses of accommodations for many types of students is unclear, and current guidelines for their use are highly inconsistent from state to state (see, e.g., National Research Council, 1997). Differences in the use of accommodations could alter the meaning of individual scores across states, and the lack of clear evidence about the effects of accommodations precludes taking them into account in comparing scores (see, e.g., Halla, 1988; Huesman, 1999; Rudman and Raudenbush, 1996; Whitney and Patience, 1981; Dulmage, 1993; Joseph, 1998; Williams, 1981). Timing of Administration The time of year at which an assessment is administered will have potentially large effects on the results (see Figure 2-3 for a comparison of state testing schedules). The nature of students' educational growth in different test areas is different and uneven throughout the school year (Beggs and Hieronymus, 1968). In most test areas, all of the growth occurs during the academic school year, and in some areas students actually regress during the summer months (Cooper et al., 1996). The best source of data documenting student growth comes from the national standardizations of several widely used achievement batteries. These batteries place the performance of students at all grade levels

OCR for page 14
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests TABLE 2-1 Accommodations Used by States Type of Accommodation Allowed Number of States Presentation format accommodations   Oral reading of questions 35 Braille editions 40 Use of magnifying equipment 37 Large-print editions 41 Oral reading of directions 39 Signing of directions 36 Audiotaped directions 12 Repeating of directions 35 Interpretation of directions 24 Visual field template 12 Short segment testing booklet 5 Other presentation format accommodations 14 Response format accommodations   Mark response in booklet 31 Use of template for recording answers 18 Point to response 32 Sign language 32 Use of typewriter or computer 37 Use of Braille writer 18 Use of scribe 36 Answers recorded on audiotape 11 Other response format accommodations 8 Test setting accommodations   Alone, in study carrel 40 Individual administration 23 With small groups 39 At home, with appropriate supervision 17 In special education class 35 Separate room 23 Other test setting accommodations 10 Timing or scheduling accommodations   Extra testing time (same day) 40 More breaks 40 Extending sessions over multiple days 29 Altered time of day 18 Other timing-scheduling accommodations 9 Other accommodations   Out-of-level testing 9 Use of word lists or dictionaries 13 Use of spell checkers 7 Other 7   SOURCE: Adapted from Roeber et al. (1998).

OCR for page 14
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests FIGURE 2-3 Time of administration of 4th-grade reading and 8th-grade mathematics components during the 1997-1998 school year; number of states. Notes: States were counted more than once when their assessment programs contained multiple reading or mathematics components that were administered at different times during the year. Source: Adapted from Olson et al. (in press). (K-12) on a common scale, making it possible to estimate the amount of growth occurring between successive grade levels.2 The effect that time of year for testing could have on the absolute level of student achievement is illustrated in Table 2-2. This table shows 2    The most recent national standardizations of the Stanford Achievement Test (SAT), the Comprehensive Tests of Basic Skills (CTBS), and the Iowa Tests of Basic Skills (ITBS)/Iowa Tests of Educational Development (ITED) showed very similar within-

OCR for page 14
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests corporate them, or nearly identical items, in their instruction. This problem of inappropriate coaching, or teaching to the test, is especially apparent if the stakes associated with test performance are high. To circumvent these problems, most commercial testing programs create several equivalent forms of the same test. The equivalent forms may be used on specified test dates or in different jurisdictions. However, creating equivalent versions of the same test is a complex and costly endeavor, and test publishers do not develop unlimited numbers of equivalent forms of the same test. Consequently, varying dates of test administration pose a security risk. Stakes Differences in the consequences or "stakes" attached to scores can also threaten comparability of scores earned on the same free-standing test. The stakes associated with test results will affect student test scores by affecting teacher and student perceptions about the importance of the test, the level of student and teacher preparation for the test, and student motivation during the test (see e.g., Kiplinger and Linn, 1996; O'Neil et al., 1992; Wolf et al., 1995; Frederiksen, 1984). The specific changes in student and teacher behavior spurred by high stakes will determine whether differences in stakes undermine the ability of a free-standing test to provide a common measure of student performance. For example, suppose that state A imposes serious consequences for scores on a specific national test, while state B does not. This difference in stakes could raise scores in state A, relative to those in state B, in two ways. Students and teachers in state A might simply work harder to learn the material the test is designed to represent—the domain. In that case, higher scores in state A would be appropriate, and the common measure would not be undermined. However, teachers in state A might find ways to take shortcuts, tailoring their instruction closely to the content of the test. In that case, gains in scores would be misleadingly large and would not generalize to other tests designed to measure the same domain. In other words, teachers might teach to the test in inappropriate ways that inflate test scores, thus undermining the common measure (see, e.g., Koretz et al., 1996a; Koretz et al., 1996b). States administer tests for a variety of purposes: student diagnosis, curriculum planning, program evaluation, instructional improvement, promotion/retention decisions, graduation certification, diploma endorsement, and teacher accountability, to name a few. Some of these purposes,

OCR for page 14
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests such as promotion/retention, graduation certification, diploma endorsement, and accountability, are high stakes for individuals or schools. Others, such as student diagnosis, curriculum planning, program evaluation, and instructional improvement are not. ABRIDGMENT OF TEST CONTENT FOR EMBEDDING In the previous section we outlined a variety of conditions that must be met to obtain a common measure and outlined how policies and practices of state testing programs make such conditions difficult to achieve, even when embedding is not involved. Embedding, however, often makes it more difficult to meet these conditions and raises a number of additional issues as well. Reliability As long as the items in a test are reasonably similar to each other in terms of the constructs they measure, the reliability of scores will generally increase with the number of items in the test. Thus, when items are reasonably similar, the scores from a longer test will be more stable than those from a shorter test. The effect of chance differences among items, as well as the effect of a single item on the total score, is reduced as the total number of items increases. Embedding an abridged national test in a state test or abridging the state test and giving it with the national test would provide efficiency, compared with administration of the entire state and national tests, but it produces that efficiency by using fewer items. Hence, the scores earned on the abridged test would not be as reliable as scores earned on the unabridged national test. The short length of the abridged test will also increase the likelihood of misleading differences among jurisdictions. Test reliability is a necessary condition for valid inferences from scores. Content Representation No set of embedded items, nor any complete test, can possibly tap all of the concepts and processes included in subject areas as complex and heterogeneous as 4th-grade reading and 8th-grade mathematics in the limited time that is usually available for testing. Any collection of items will tap only a limited sample of the skills and knowledge that make up

OCR for page 14
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests the domain. The items in a national test represent one sample of the domain, and the material selected for embedding represents only a sample of the national test. The smaller the number of items used in embedding, the less likely it is that the embedded material will provide a representative sample of the content and skills that are reflected in the national test in its entirety. How well the national test represents the domain, and how well the embedded material represents the national test, can be affected by both design and chance. The potentially large effect of differences in sampling from a subject area are illustrated by data from the Iowa Tests of Basic Skills (ITBS) for the state of Iowa (see Figure 2-5). Between 1955 and 1977 the mathematics section of the ITBS consisted of math concepts and math problem-solving tests but did not include a separate math computation test. In 1978 a math computation test was added to the test battery, but the results from this test were not included in the total math score reported in the annual trend data. The trend data from Iowa for 1978-1998 for grade 8 illustrate clearly how quite different inferences might be made about overall trends in math achievement in Iowa depending on whether or not math computation is included in the total math score. Without computation included, it appears that math achievement in Iowa increased steadily from 1978 to the early 1990s and has remained relatively stable since. However, when computation is included in the total math score, overall achievement in 8th-grade mathematics appears to have gone steadily down from its 20-year peak in performance in 1991. Similar differences would be expected in the math performance of individuals, individual school districts, or states depending on whether computation is a major focus of the math curriculum and on how much computation is included in the math test being used to measure performance. Abridgment of the national test can affect scores even in the absence of systematic decisions to exclude or deemphasize certain content. Even sets of items that are selected at random will differ from each other. Students with similar overall proficiency will often do better on some items than on others. This variation, called "student by task interaction," is the most fundamental source of overall reliability of scores (Gulliksen, 1950; Shavelson et al., 1993; Dunbar et al., 1991; Koretz et al., 1994). Therefore, particularly when the embedded material is short, some students may score considerably differently depending on which sample of items is embedded. Abridgment could affect not only the scores of individual students, but also the score means of states or districts. As embedded material is

OCR for page 14
Embedding Questions: The Pursuit of a Common Measure in Uncommon Tests Figure 2-5 Iowa trends in achievement in mathematics: 1978-1998, grade 8. Source: Adapted from Iowa Testing Programs (1999).

OCR for page 14

OCR for page 14

OCR for page 14