Page 71

5

Changed NAEP: Use of a Short-Form Version of NAEP

As currently configured, NAEP employs a matrix sampling method for administering items to students (see Chapter 2 ) but does not include the option for administering a fixed test form to large numbers of individuals (Allen, Carlson, & Zelenak, 1998). Implementing such an option will require major changes to the way NAEP test forms are constructed and NAEP results are reported. Such changes are certainly within the realm of possibilities, however. NAGB has active working groups (National Assessment Governing Board, 1999c; National Assessment Governing Board, 2000a) looking into alternate delivery and reporting models for NAEP, and the short form and market-basket concepts originated from the activities of those groups.

This chapter deals explicitly with the short form and addresses the questions: (1) What role might a short form play in providing market-basket results; and (2) How might the short form be used? The chapter begins with a discussion of NAGB's policy and plans for the short form, which is followed by a description of the ways states and districts might use the short forms based on comments from participants in the committee's workshop on market-basket reporting. The chapter continues with a review of the pilot study of short forms and ends with discussion of ways to construct short forms.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 71
Page 71 5 Changed NAEP: Use of a Short-Form Version of NAEP As currently configured, NAEP employs a matrix sampling method for administering items to students (see Chapter 2 ) but does not include the option for administering a fixed test form to large numbers of individuals (Allen, Carlson, & Zelenak, 1998). Implementing such an option will require major changes to the way NAEP test forms are constructed and NAEP results are reported. Such changes are certainly within the realm of possibilities, however. NAGB has active working groups (National Assessment Governing Board, 1999c; National Assessment Governing Board, 2000a) looking into alternate delivery and reporting models for NAEP, and the short form and market-basket concepts originated from the activities of those groups. This chapter deals explicitly with the short form and addresses the questions: (1) What role might a short form play in providing market-basket results; and (2) How might the short form be used? The chapter begins with a discussion of NAGB's policy and plans for the short form, which is followed by a description of the ways states and districts might use the short forms based on comments from participants in the committee's workshop on market-basket reporting. The chapter continues with a review of the pilot study of short forms and ends with discussion of ways to construct short forms.

OCR for page 71
Page 72 STUDY APPROACH During the course of the study, we reviewed policy statements addressing the short form (National Assessment Governing Board, 1996; National Assessment Governing Board, 1999a; National Assessment Governing Board, 1999b; Forsyth et al., 1996) and information on the ETS year 2000 pilot study on market-basket reporting (Mazzeo, 2000). We asked Patricia Kenny, co-director of the National Council of Teachers of Mathematics (NCTM) NAEP Interpretive Reports Project, to review the two short forms developed for the pilot study. We also focused specifically on the short form during our workshop on market-basket reporting and asked participants to discuss their interest in and potential uses for the short form of NAEP (see Chapter 4 for additional details on the workshop). WHAT ARE NAEP SHORT FORMS AND HOW MIGHT THEY BE USED? Policy for the NAEP Short Form In the most recent redesign policy, the short form is cited as a mechanism for simplifying NAEP design, specifically (National Assessment Governing Board, 1999a:7): Plans for the short-form of the National Assessment, using a single test booklet, are being implemented. The purpose of the short-form test is to enable faster, more understandable initial reporting of results and, possibly, for states to have access to test instruments allowing them to obtain NAEP assessment results in years in which NAEP assessments are not scheduled in particular subjects. To guide policy and decision making on the measurement issues pertaining to the short forms, NAGB adopted the following principles (National Assessment Governing Board, 1999b): Principle 1: The NAEP short form shall not violate the Congressional prohibition to produce, report, or maintain individual examinee scores. Principle 2: The Board shall decide which grades and subjects shall be assessed using a short form. Principle 3: Development costs, including item development, field testing, scoring, scaling, and linking shall be borne by the NAEP program. The costs associated with use, including administration, scoring, analysis, and reporting shall be borne by the user. Principle 4: NAEP short forms intended for actual administration should

OCR for page 71
Page 73 represent the content of corresponding NAEP assessment frameworks as fully as possible. Any departure from this principle must be approved by the Board. Principle 5: Since it is desirable to report the results of the short form using the achievement levels, the content achievement level descriptions should be considered during the development of the short form. Principle 6: All versions of the short form should be linked to the extent possible using technically sound statistical procedures. The National Assessment Governing Board's Vision and Uses for the Short Form At the committee's workshop on market-basket reporting, Roy Truby, executive director of NAGB, explained the concept of the NAEP short form, describing it as a short, administrable test representative of the content domain tested on NAEP (Truby, 2000). Results on the short form could be summarized using a percent correct metric. The short form could provide additional data collection opportunities that are not part of the standard NAEP schedule, such as testing in off years or in other subjects not assessed at the state level. Truby described how some people envision using a short form: If short forms were developed and kept secure, they could provide flexibility to states and any jurisdiction below the state level that were interested in using NAEP for surveying student achievement in subjects, grades, and times that were not part of the regular state-NAEP schedule. Once developed, such market-basket forms should be faster and less expensive to administer, score, and report than the standard NAEP, and could provide score distributions without the complex statistical methods on which NAEP now relies. This might help states and others link their own assessments to NAEP, which is another important objective of the Board's redesign policy. Truby noted that the details associated with these components of the market-basket concept have not yet been thoroughly investigated. Based on the pilot study findings (see Mazzeo, 2000), NAGB might pursue similar studies in other content areas and grades. Workshop Participants' Visions and Uses for the Short Form Some school administrators and directors of assessment were attracted to the concept of the short form as a means for obtaining benchmarking data. They envisioned the short form as a test that could be administered to an entire cohort of students (e.g., all fourth grade students in a school or

OCR for page 71
Page 74 in a district); short form results could be quickly derived and aggregated to the appropriate levels (i.e., school or district level). Under this vision for the short form, summaries of short-form results could be compared to those for national NAEP to provide schools and districts with information on how their students' achievement compared with national results. Participants believed that this information would be uniquely useful in assessing students' strengths and weaknesses and in setting goals for improving student achievement. Some school administrators and assessment directors also envisioned the short form as a set of questions that could be embedded into current assessments as a mechanism for “linking” results from current assessments to NAEP. Under this vision, the set of questions could be administered in conjunction with other state or local assessments. Short form results could be used to enable comparisons between state and local assessment and main NAEP. It is important to point out that the issues associated with establishing linkages between NAEP and state and local assessments were previously addressed by two other NRC committees (National Research Council, 1999a; National Research Council, 1999d), who cited numerous problems with such practices. Curriculum specialists saw the short form as a way to gather additional information about what is tested on NAEP and how it compares to their instructional programs. The released short form could permit educators and policy makers to have first-hand access to the material included on the test. Their review of the released material could promote discussions about what is tested and how it compares with the skills and material covered by their own curriculum. The secure short form would yield data that could further these discussions. Educators could examine student data and evaluate performance in relation to their local practices. They could engage in discussions about their curricula, instructional practices, and sequencing of instructional material, and could contemplate changes that might be needed. Participants also liked the idea of having a NAEP test to administer in “off-years” from regular NAEP administrations. Because NAEP does not currently administer every subject to every grade every year, workshop participants believed the short form could help fill the “gaps.” The short form could be given every year thereby enabling the compilation of yearly trend data. These uses for the short forms are discussed in greater detail in the workshop summary (National Research Council, 2000).

OCR for page 71
Page 75 Workshop Participants' Concerns About the Short Forms Some workshop speakers challenged the premises behind the various uses for the short form. Several questioned how comparable scores on the short forms would be to results from NAEP. As described in Chapter 2, NAEP uses complex procedures for deriving score estimates (including the conditioning and plausible values methodologies). If short form results were provided quickly and without the complex statistical methods, results from the short form would not be conditioned; hence short form results would not be comparable to the regular NAEP-scale results. Comparisons between short form results and state or local assessment results also might not yield the type of information desired. State and local assessments are part of an overall program in which curricula, instruction, and assessments are aligned. However, alignment may not extend to the NAEP frameworks, and the short form might test areas not covered by the curriculum. While it might be enlightening to compare NAEP's coverage with local curricula, testing students on material they have not been taught presents problems for interpreting the results. Student motivation would also factor into performance on the short form. State and local assessments tend to be higher stakes exams that carry consequences. At present, NAEP is not a high-stakes test. Administration of the short form as part of a high-stakes assessment would change the context in ways that could affect the comparability between results on the short form and the regular NAEP results. The prohibition against individual results was also cited as problematic. The short form could be administered in a manner closely resembling other testing—testing that results in individual score reports. Although individual results would be generated initially from the short form, they would need to be aggregated for reporting purposes. Participants felt that this prohibition would be difficult to explain. These concerns about the short form are discussed in detail in the workshop summary (National Research Council, 2000). Review of the Pilot Short Forms As explained in Chapter 4, ETS prepared two fourth grade mathematics short forms as part of the year 2000 pilot study. One of the two pilot short forms contains 31 items and the other 33. These items were intended to represent NAEP's existing fourth grade mathematics item pools. During

OCR for page 71
Page 76 the workshop, Bock (2000) estimated that the reliability of the short form would be likely to fall in the low .80 range. While this might be considered acceptable, the more pertinent concern for the short form is not reliability but generalizability. That is, would performance on the short-form support inferences about performance on the larger domain of mathematics? For the workshop, the committee asked Patricia Kenney to consider the feasibility of creating short forms for fourth grade mathematics and the extent to which the developed short forms were representative of what NAEP tests. Kenney reported that the short forms appeared to represent the general content strands and the item types in the frameworks. However, she questioned whether the forms covered the full range of cognitive processes the framework describes, as well as all of the 56 topics and subtopics covered by the NAEP frameworks. Kenney questioned the extent to which approximately 30 items would be able to adequately represent the frameworks at the topic or subtopic level (Kenney, 2000). Additionally, NAEP items can be administered at more than one grade level. Because NAEP results are not reported at the student level, there is no disadvantage for assessing students on topics that they may not have studied. The problem with these “grade overlap” items, however, is that they might become misinterpreted as NAEP grade-appropriate expectations. Considering the uses cited above for the short forms, Kenney was concerned about how these grade overlap items would be regarded (Kenney, 2000). THE DESIRED CHARACTERISTICS OF A SHORT FORM Given the alternative visions and uses described above, we can now consider options for constructing and implementing short form NAEP. NAGB policy (National Assessment Governing Board, 1999b) states that “NAEP short forms intended for actual administration should represent the content of corresponding NAEP assessment frameworks as fully as possible” (Principle 4). This statement implies that NAGB's intent is to produce short forms that are samples of the domain represented by the framework. While it does not seem to be the intent to represent the current NAEP item pool, or to create scales, the short form needs to be capable of providing estimates of the true score distribution that is the target for full NAEP. That distribution is needed to support policy Principle 5, “Since it is desirable to report the results of the short form using the achievement levels, the content achievement level descriptions should be considered dur-

OCR for page 71
Page 77 ing the development of the short form.” Reporting results according to the achievement levels requires an accurate estimation of the proficiency distribution. Estimation of the distribution requires the specification of a scale. NAGB policy does not provide any further guidance about the desired technical characteristics for the short form. While the generality of policy statements is appropriate so that developers are not limited in the approaches they consider for putting policy into practice, the lack of detail allows a variety of interpretations. For example, the state and district test directors imagine a short form of 10 to 15 items that can be embedded in their tests as anchors to link their tests to the NAEP scale (O'Reilly, 2000). The ETS-produced pilot versions contain 31 and 33 items (Mazzeo, 2000)—twice the length imagined by the test directors. Since the ETS pilot short form was limited by other constraints, additional conceptions would also be feasible. The NAGB materials (National Assessment Governing Board, 1999b) and the discussions at the workshop (National Research Council, 2000) imply the following specifications for the short form. 1. The short form should represent the NAEP framework. 2. The short form should be at least somewhat consistent with the achievement level descriptions. 3. It should be possible to aggregate the data from the short forms to provide good estimates of mean performance for subgroups of the student population. 4. It should be possible to estimate the proportion above the achievement level cutscores. This implies that the short form can support estimation of the distribution of scores on the NAEP scale. 5. It should be possible to compare results from alternate forms of short forms for a curriculum area, which implies that the short forms are to be put on a common score scale, perhaps through an equating process. 6. Some would like to use the short form as an anchor test for connecting other testing programs to the NAEP reporting scale. This use is not addressed by the policy for the short form. These specifications present a challenging development task because the short form will necessarily have different psychometric characteristics than the full set of current NAEP items or any one of the NAEP booklets. Successful accomplishment of this development task depends on the degree

OCR for page 71
Page 78 to which each requirement must be met. For example, if the level of accuracy of a mean estimate from the short form does not have to be as great as that for the full NAEP, then requirement 3 can probably be met. However, if the level of accuracy of the mean estimates must be the same as for full NAEP, then the design of the short form and its administration plan will be very challenging. To assist NAEP's sponsors with these difficult issues, we consider them next. MEETING THE DESIRED SPECIFICATIONS FOR THE SHORT FORMS Representing the Frameworks The first requirement is that the short form should represent the NAEP framework. However, “represent” is open to multiple interpretations. One interpretation is a formal statistical sampling from a population. If every item in the domain had an equal chance of being sampled, the resulting sample would represent the entire population. The short form could then either represent the domain (i.e., the framework) or the current NAEP pool. These are not synonymous because the current NAEP pool may not “represent” the NAEP framework in any statistical sense; that is, the items in the NAEP pool are not a random sample from the domain. A short form could be constructed to “represent” the current NAEP pool in a statistical sense by randomly sampling items from that pool. Such a sample might not include items from every content specification category, but they would be an unbiased statistical sample and would therefore represent the larger number of items. A more general interpretation of “represent” might be that the short form provides “examples” of the types of tasks required by NAEP. Under this interpretation, the NAEP item pool would be considered to represent the framework, and any set of items that assesses the skills listed in the framework would represent the framework by example. Because the framework is very broad, it would be impossible to present sample items for every type of skill and knowledge in the framework. Thus for practical reasons, the short form's representation of the framework must be incomplete, and the short form would represent the framework less well than the full NAEP pool represented the framework. An even looser interpretation of “represent” could be that the items on a short form provide selected examples of the kinds of items developed

OCR for page 71
Page 79 from the framework. The items in the NAEP pool could be sorted according to the skills required by the achievement level descriptions to help meet requirement 2. While these sortings would not be perfectly reliable, they could support a loose definition of representation. Either a stratified sampling of 30 items could be drawn from that pool, or a carefully reasoned sample could be selected to produce a descriptive example of the pool. If the current NAEP items cannot be used, new items could be produced that measure skills consistent with the frameworks document. All of these options meet a loose definition of “represent.” Approaches to Constructing Short Forms The Standards for Educational and Psychological Tests (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999) present guidelines to be followed in constructing a short form version of a longer test. Specifically, Standard 3.16 states: If a short form of a test is prepared, for example, by reducing the number of items on the original test or organizing portions of a test into a separate form, the specifications of the short form should be as similar as possible to those of the original test. The procedure for reduction of items should be documented. Given these guidelines, we describe two procedures that could be used to develop short forms. Domain Sampling Approach Given that the goal of NAEP is to assess the knowledge and skills of fourth, eighth, and twelfth grade students in the areas defined by the frameworks, the measurement model that seems most appropriate to this task is domain sampling (Nunnally, 1967:175). If a domain sampling approach were to be used, the NAEP framework would define the domain, and the goal of test development would be to produce an instrument that contains tasks that are an appropriate sample from that domain. Ideally, the framework would be translated into specifications that clearly delimit the types of items included in the domain. With this approach, developers would produce many items that represented the domain, and forms would be developed by sampling from the

OCR for page 71
Page 80 set of items. For the purposes of the present discussion, the full set of all NAEP items included in all forms given to students during an operational NAEP administration will be considered the long form. It cannot be considered a long form in the usual sense, because no student would take all of the items. However, the “long form” would define the score scale for reporting NAEP results. The short form would simply be a test containing fewer items than the long form. Under a domain sampling approach, a short form of NAEP could be developed by selecting a smaller sample of items than for the long form. This process for creating a short form would address Standard 3.16 because the specifications for the domain are the same for both the short and long form. If formal statistical sampling procedures were used, both the long form and the short form would represent the full domain but to different degrees of accuracy. The NAEP item and form development process has not been as formal as the domain sampling model. A large pool of items has been produced to match the content and cognitive skills described in each framework document, but the items that have been produced were not intended to be a statistically representative sample from the domain (Allen, Carlson, & Zelenak, 1998). The framework documents do not define clear boundaries for the domain (Forsyth, 1991), and no criteria are given for determining whether or not an item is a part of the domain. At best, the items in a set of NAEP booklets for a content area can be considered to be a sample from the domain, but a sample with unknown statistical properties. Hence, construction of a short form becomes more challenging than merely taking a statistical sample from a well-defined pool of items. Because the NAEP forms are an idiosyncratic sample from the domain, the best approach from a domain sampling perspective is to select a sample of items from the current set of items. The resulting sample would be representative of the items on a current set of NAEP forms, but would not necessarily be representative of the full domain. The stratified random sampling plan could be used to make sure that important content strands are proportionally represented. Scale Construction Approach An alternative procedure might be based on the trait estimation approach commonly used in psychology (McDonald, 1999), which defines a hypothetical construct and then selects test items estimated to be highly

OCR for page 71
Page 81 correlated with the construct. While the resulting set of items defines a scale for the construct, there is no intention to define a domain of content or to sample from the domain. The test development process is considered effective as long as the set of items rank orders individuals on the scale for the hypothetical construct. Employing this approach with NAEP would imply that NAEP's purpose is to place students along one or more continua based on their responses to the test items. The items would be selected to define scales rather than to represent the domain. To be consistent with the requirements of Standard 3.16, the short form would have to define the same scales as the full NAEP. Precision of Measurement Either approach to developing a short form would result in a test with different measurement properties than a “long” form. For instance, scores from the short form will have less precision of measurement than a test consisting of the full set of current NAEP items. The comment to Standard 3.16 addresses the differences in measurement properties and calls for their documentation, saying: The extent to which the specifications of the short form differ from those of the original test, and the implications of such differences for interpreting the scores derived from the short form, should be documented. One clear difference between the short form and the long form is that scores from the short form will have a different reliability and standard error structure 1 than those from the full NAEP pool even though the short form and full NAEP provide samples from the same domain of content (National Research Council, 2000). If the domain sampling approach is used, the short form will result in greater sampling error than full NAEP because a smaller sample is taken from the content domain. Although both sets of items (test forms) would represent the domain, and both would measure the same constructs, the smaller sample would have larger estimation error. 1 Standard error structure refers to the pattern of conditional standard errors of measurement at different points on the reporting score scale. Because of the different lengths of the two forms, the conditional standard errors will certainly not be the same at every point on the score scale.

OCR for page 71
Page 82 Under the scale formation approach, the content framework determines the number of scales that need to be considered. For example, NAEP Mathematics reports scores that are weighted composites of five scales (National Assessment Governing Board, 2000) that are combined using weights to form a composite that is used for reporting. When the test is shortened, the number of scales would remain the same, but fewer items would be used to define the scales. Because the scale of measurement for the short form would be defined with less fine gradations than defined by the full set of items, scores would be estimated with less precision of measurement. Discussions of the relative standard error of measurement for the short form and the full NAEP must be carefully considered. In the matrix sampling design used by NAEP, the standard error of measurement for a student is large for long form NAEP—possibly larger than the standard error of measurement for a hypothetical short form. However, estimates of population parameters, such as the population mean and standard deviation, are based on the full set of items and the full sample of students, and they use collateral background information to “condition” the estimation process (see Chapter 2 ). Consequently, the estimation of population parameters should be much more precise for full NAEP than for a short form even though the short form might yield smaller measurement error for a student's score if individual scores were permitted to be generated for NAEP. Technical Requirements for a Short Form The technical requirements for a short form are very challenging. Requirements 3 and 4 suggest that the short form allow estimation of means and percentages of distributions on the NAEP scale. This implies that the short form would produce scores on the same composite of skills as the full NAEP pool. This is also required by Standard 3.16. Producing a short form that will result in scores that fulfill the statistical requirements will require careful matching of content and statistical characteristics of the items on the short form to the NAEP item pool. This can best be done using multidimensional procedures to select items that create the desired composite score and a score distribution that is similar to that from the full NAEP sample. In theory, this could be accomplished using the full set of tools available from IRT and computerized test assembly methodologies. Even with those tools, however, the test assembly process will be difficult, and it will be necessary to confirm that the desired composite of abilities is assessed.

OCR for page 71
Page 83 CONCLUSIONS AND RECOMMENDATIONS The committee's review of the materials on the short form concept indicates that NAGB and potential consumers of short form results have varying conceptions of the short form. Some (McConachie, 2000; O'Reilly, 2000) believe the short form should function as an anchor test that can be used to link various types of assessments to NAEP so the results can be reported on the NAEP score scale. Others (Mazzeo, 2000; Truby, 2000) view the short form as a mechanism for implementing market-basket reporting or as a way of facilitating district-level reporting and providing more responsive reporting of NAEP results (O'Reilly, 2000; Truby, 2000). These differing views about the short form make it difficult for the committee to make specific recommendations because so many details have yet to be decided. Nevertheless, the conception of many workshop participants that the short form could be used as an anchor to put state assessment results on the NAEP scale is not likely to be tenable. The difficulties associated with attempts to achieve such links among assessments have been documented in previous reports by other NRC committees (National Research Council, 1999a; National Research Council, 1999d). CONCLUSION 5-1. Thus far, the NAEP short form has been defined by general NAGB policy, but it has not been developed in sufficient technical and practical detail that potential users can react to a firm proposal. Instead, users are projecting into the general idea their own desired characteristics for the short form, such as an anchor for linking scales. Some of their ideas and desires for the short form have already been determined to be problematic. It will not be possible for a short form design to support all uses described by workshop participants. The most positive result that can be expected from attempts at short form construction is that the short form is shown to measure the same composite of skills and knowledge as the full NAEP pool and that the distribution of statistical item characteristics is such that the shape of the estimated score distribution will be similar, though not identical, to that for current NAEP. The distribution will probably not be exactly the same because of differences in the error distribution that result from using a shorter test. The practical result is that the mean scores estimated from the

OCR for page 71
Page 84 short form will probably have larger standard errors than those from the full NAEP and that the estimates of proportions above the achievement level cutscores will also contain more error. The results from the short form will probably look different than those from full NAEP, even if exactly the same students took both types of tests. The differences in error will add “noise” to the results of the two types of tests in different ways. Comparisons of short form and full NAEP results will not be easy, even for technically sophisticated consumers. The fact that the two sets of results are not directly comparable does not mean that the short form might not be useful. It does mean, however, that the differences in interpretation must be made clear to avoid confusion. One way would be to use different score scales and to report short form scores as estimates of the proportion of the full NAEP pool that students would get correct rather than scores on the NAEP score scale. In this case, the error in estimates could be indicated with error bars or other reporting methods. Use of different score scales would preclude making direct comparisons, but the short form may still have value as a more frequent monitor of student capabilities. However, it is worth restating here that, to many workshop participants, being able to make comparisons with main NAEP was one of the more appealing features of the short form. CONCLUSION 5-2: The method selected for producing a short form will likely result in a test that has a different reliability (error structure) than the full NAEP, resulting in different estimates of the score distribution than the full NAEP. As a result, the short form will likely give different numerical results than the full NAEP, even if the samples of students were identical. RECOMMENDATION 5-1: Before attempting to use a short form version of NAEP to estimate results on the current NAEP scale, the differences in the psychometric characteristics of the scores from the short form and current NAEP should be carefully investigated. RECOMMENDATION 5-2: Before proceeding with the short form, it should be determined whether it is possible to obtain estimates of NAEP score distributions from the short form

OCR for page 71
Page 85 that will provide estimates of proportions above achievement levels and means for subgroups of the examinee population that are of similar accuracy to those from current NAEP. RECOMMENDATION 5-3: If the decision is made to proceed with the short form, methods should be developed for reporting performance on the short form in a way that is meaningful and not misleading given the differences in quality of estimates for current NAEP and the short form.