The Perspectives of NAEP's Sponsors and Contractors
In preparation for the market-basket workshop, the Committee on NAEP Reporting Practices asked NAEP's sponsors and contractors to respond to a series of questions regarding components of the market-basket concept. Specifically,
What are the primary objectives for market-basket administration, market-basket reporting, and the short form?
Who are the proposed users of market-basket materials and the short form?
What types of inferences are expected to be supported by shortform and market-basket results?
What is the status of research and development work on the market basket and short form?
What are the Board's plans for pursuing work on the market basket/short form—with regard to the 2000 assessment and beyond?
Representatives from the sponsoring agencies (NAGB and NCES) and contracting agency (ETS) responded to these questions by preparing papers prior to the workshop. At the workshop, each representative made an oral presentation of the material covered in his paper. The committee asked workshop discussants to respond to papers as well as to the oral presentations. A summary of each paper and presentation is provided below to give
A MARKET BASKET FOR NAEP: POLICIES AND OBJECTIVES OF THE NATIONAL ASSESSMENT GOVERNING BOARD
During his presentation, Roy Truby, executive director of NAGB, explained the rationale for exploring the market-basket concept. Truby reminded the audience of the overall purpose of the NAGB redesign adopted in August 1996: to enable NAEP to assess more subjects more frequently, release reports more quickly, and provide information to the general public in a form that is readily understood. With these goals in mind, NAGB began considering alternatives including a NAEP market basket.
Under one alternative, students' results on a representative set of NAEP test items would be presented using percent correct scores (like scenario one in Figure 1). According to Truby, reporting NAEP results using a percent-correct metric would be more understandable for the general public and would allow for more timely reporting of NAEP results. Furthermore, the released items would be representative of the NAEP frameworks and would provide more clarity to the public about the content and skills tested by NAEP.
A second alternative involves the construction of a short, administrable NAEP test, the “short form,” that would be representative of the content domain tested on NAEP (like scenario two in Figure 1). Results on the short form could be summarized using a percent-correct metric. The short form would provide additional data collection opportunities to state-NAEP users that are not part of the standard NAEP schedule, such as testing in off years or in other subjects not assessed at the state level. Truby described how some people envision using a short form:
If short forms were developed and kept secure, they could provide flexibility to states and any jurisdiction below the state level that were interested in using NAEP for surveying student achievement in subjects, grades, and times that were not part of the regular state-NAEP schedule. Once developed, such market-basket forms should be faster and less expensive to administer, score, and report than the standard NAEP, and could provide score distributions without the complex statistical methods on which NAEP now relies. This might help states and others link their own assessments to NAEP, which is another important objective of the Board's redesign policy.
Truby noted that NAGB has approved a policy for “market-basket reporting” and has approved a pilot for a “market-basket short form,” but added that the details associated with these components of the market-basket concept have not yet been thoroughly investigated.
Truby concluded by explaining that ETS is currently investigating the market-basket concept by conducting a pilot study in grade four mathematics as part of NAEP 2000. This study involves preparation of NAEP short forms (scenario two in Figure 1). Details of the study are described below (see section entitled “NAEP's Year 2000 Market Basket Study: What Do We Expect to Learn?”). Based on the findings from the pilot study, NAGB might pursue similar studies in other content areas and grades.
SIMPLIFYING THE INTERPRETATION OF NAEP RESULTS WITH MARKET BASKETS AND SHORTENED FORMS OF NAEP
Andrew Kolstad, senior technical advisor for the assessment division of the National Center for Education Statistics, traced the history of NAEP's reporting methods. During the 1970's, NAEP reported its results in terms of the percentage of students who correctly answered each test item (item p-values) as well as the average percent of items answered correctly (average percents correct). Since many items were released along with information on the percentages of students who answered them correctly, reporting item p-values offered specific and concrete information to data users. While this procedure gave data users a good sense of what was covered on the test and how students performed, it had at least two drawbacks: first, it required the development of a substantial number of new items in each assessment cycle in order to replace those released; and second, data users had a hard time understanding the overall picture of student performance on collections of items.
Reporting the average percent correct over a set of items helped to overcome the second problem because this gave an overall picture of student performance. Nevertheless, several drawbacks remained. First, this method provided only one piece of information about the performance distribution, namely, the mean percent correct, and did not provide any information about the rest of the performance distribution. Second, the percent-correct summary statistic also suffered from the limitation that if the set of items changed, then the average percent correct would also
change. In other words, if the sample of items was relatively easy, the average percent correct would be higher than if the sample of items was relatively hard. This created interpretation problems, particularly with interpretations of trends in performance. The composition of items changed from one assessment to the next as items were dropped from the assessment pool (because they had been released). Third, making generalizations about students' performance on a fixed collection of administered items to their expected performance on other non-administered items, albeit items from the same frameworks, was problematic. The idea that test questions were sampled from a pool of potential items was not yet formalized.
In the early 1980's, ETS became the test development contractor for NAEP. ETS began using item response theory (IRT)1 scaling, which alleviated many problems of interpretation deriving from the practice of reporting percent corrects for subsets of items. Item response theory scales items according to the probability of a correct answer, given the proficiency level of the examinee and the item's discriminating power, difficulty, and susceptibility to examinee guessing. It relies on assumptions that, if met, result in proficiency estimates that, theoretically, are not dependent on the particular subset of items administered and that yield item parameter estimates that are relatively independent of the group of students taking the items.
With the introduction of IRT scaling, average IRT-based scale scores replaced average percent correct scores for NAEP reporting. However, many data users regard IRT-based scale scores as substantially less interpretable than percent correct scores. While NAEP still releases a few items as illustrative of the assessment, a substantially smaller proportion of items are released (reducing development costs). Also, item p-values have been replaced by IRT-based item mapping. Item mapping provides an interpretation of the relative difficulty of test items, as well as of the performance of examinees relative to items of differing difficulty. However, the item mapping procedure has been subject to controversy because it requires somewhat arbitrary decisions about the probability thresholds used.
Kolstad believes that the use of market-basket reporting and percent correct scores in conjunction with IRT-based scaling—as supported by NAGB and suggested in an NRC report, Grading the Nation's Report Card,(National Research Council, 1999b)—could improve understanding of
Item response theory is a statistical model that calculates the probability each student will get a particular item correct as a function of underlying ability; for further discussion of IRT modeling, see Lord (1980).
NAEP reporting. Kolstad's conception of market-basket reporting is one in which IRT-based scaling would be used to project the expected percent correct on a market basket of items (scenario one in Figure 1), an approach that does not require the actual administration of those items (provided that the IRT-based item parameters are known).
Kolstad pointed out that the proposed market-basket reporting of expected percent correct scores on a market-basket collection of items is better than the average percent correct used in the early days of NAEP for several reasons. One is that the IRT-based approach would include publication of the market-basket set of items that constitute the pool of questions, which could improve understanding of item content. Because the items need not be administered during each assessment cycle in order to be used for this kind of reporting, developmental costs would be minimized. Furthermore, IRT-based projections can differentiate between performance on easy and hard test questions. If the difficulty composition of the items in the market-basket set changes, the results can be appropriately adjusted through the use of IRT-based projections. Unlike the use of average percents correct in the early days of NAEP, the use of IRT-based projections of expected percent correct on a market basket of items enables prediction of performance on other items from the same framework that did not happen to be included in NAEP's assessment instrument.
Kolstad believes that focus groups and empirical studies should be conducted to verify that the market-basket metric—expected percent correct—is indeed simpler for consumers to understand. Kolstad also cautioned that invalid inferences about achievement-level performance would be drawn from empirical average percent correct scores, unless they are based on IRT projections, and suggested careful consideration of potential misinterpretations.
EVIDENTIARY RELATIONSHIPS AMONG DATA-GATHERING METHODS AND REPORTING SCALES IN SURVEYS OF EDUCATIONAL ACHIEVEMENT
During his presentation, Robert Mislevy, distinguished research scholar with ETS, laid the conceptual groundwork for the technical and measurement issues involved in market-basket reporting. Mislevy distinguished between data collection methods and data reporting methods. To Mislevy,
the term data collection methods refers to the means of gathering performance data, including information that bears on the test questions comprising the market basket. The term data reporting methods refers to the mechanisms used for translating performance data into a reporting metric, including performance on a market basket of items.
Mislevy described five approaches for collecting data: (1) a single test form, (2) parallel test forms, (3) tau-equivalent test forms, (4) congeneric test forms, and (5) arbitrary test forms. The first two—a single form and parallel forms—are the formats typically associated with testing programs. Under a single test form approach, one form of the test is developed, and all students take the same form. Under a parallel test forms approach, multiple equivalent forms are developed. The forms contain different items but are sufficiently similar to be considered interchangeable. They contain the same kinds and numbers of items, tap the same mix of underlying skills, are used for the same purposes, and are administered under similar conditions. Forms that are considered parallel have equal raw score means, equal standard deviations, and equal correlations with other measures for any given population.
Test forms that are either tau-equivalent or congeneric measure similar constructs but do not meet the stringent criteria of parallel forms. Tau-equivalent forms are closely related but not strictly parallel. For example, they may have the same mix of items but may differ with regard to the numbers of items. Congeneric forms are less closely related and, for example, may include the same essential mix of knowledge and skills but may differ in terms of the number, difficulty, and sensitivities of the items included.
Arbitrary forms are only generally related to the same content domain, and, for example, may differ considerably as to the mix, number, format, or content of items. While arbitrary forms may be similar with respect to timing, balance, or other characteristics, they have not been constructed to be parallel, tau-equivalent, or congeneric. For instance, one arbitrary form may focus on multiple-choice items while another may primarily use constructed response items.
Mislevy drew distinctions among three reporting metrics: the observed score metric, the true score metric, and the latent trait metric. The observed score metric is based on a simple tally of the number of right answers or the number of points received. Observed scores can quickly be converted to a percent correct scale by dividing the number correct score by the total number of questions or points. However, observed scores have the
problem, mentioned by Kolstad, of being tied to the composition and difficulty of the particular test form.
Reporting on a true score metric involves making a transformation from the observed score to the expected or predicted distribution of an individual's true score (it is a predicted score, since an individual's true score is never known). There are a number of advantages to reporting on an IRT-based true-score scale since such scores can be placed on a percent correct scale. However, given that reporting on a true score metric means working with predictive distributions of individuals' true scores, the transformation is much more complex. In particular, there is no one-to-one mapping between an observed score and an expected score.
Finally, the latent trait metric refers to the IRT-based proficiency estimates. Using this metric requires estimation of the latent trait distribution. While this process involves a complicated transformation from observed scores, it has the advantage that, when IRT assumptions are met, the distributions are not content specific. Further, the latent trait distributions could be transformed to an expected percent correct metric. NAEP currently estimates latent trait distributions that are converted to scaled score distributions for reporting. Current procedures for NAEP do not, however, transform latent trait distributions to expected percent correct metric.
Market-basket scores could be based on intact, administrable forms (like scenario two in Figure 1), like the proposed short forms. To support inferences about performance on one version of the short form to performance on another version of the short form, the short forms would need to be strictly parallel in the full technical sense. Creating parallel short forms would not be a sufficient condition to support inferences from scores on the short forms to the main NAEP scale, however. Much more complex statistical procedures would be needed to enable generalizations about performance on main NAEP based on performance on the short form.
Alternatively, market-basket scores could be based on synthetic forms (scenario one in Figure 1). A synthetic form is a form proportionately representative of the content and skill domain but too long to administer in its entirety to a single student. The concept of a synthetic form is similar to the concept of a market basket as it is used in other settings; i.e., a sampling of items intended to be representative of some larger whole (e.g. the content and skills tested by NAEP). Summarizing performance on synthetic forms using a percent-correct or observed-score reporting metric would be quite complex, as no one student would take the entire test. This approach to market-basket reporting would have to be based on hypothetical ob-
served scores for the synthetic form. And results would be modeled projections from data on some other forms.
Mislevy then proceeded to develop a framework for analyzing the complexities involved in collecting and reporting data. He identified various ways of collecting data and of reporting it, then described the kinds of inferences supported by various reporting procedures and their appropriateness for the different collection methods. Throughout the paper, Mislevy emphasized that all combinations of collection and reporting procedures involve tradeoffs. Some methods are simpler and quicker than others but do not support the desired inferences. Other methods yield generalizable results but at the expense of simplicity. A key issue for the NAEP market-basket concept is the desire to have market-basket results that are comparable to main NAEP results. The goal is to be able to make inferences about performance on the market-basket collection of items compared to a national benchmark (main NAEP). This goal becomes particularly challenging under scenario two (see Figure 1), where a short form is released to states and districts for their use and scores are to be derived quickly and are intended to be comparable to main NAEP.
In his paper, Mislevy systematically laid out the issues that need to be resolved before decisions are made on data gathering and data reporting models for the market basket. Through his analysis, Mislevy explored the competing goals of simplicity of methods versus generalizability of results. The simplest methods would use parallel, intact forms for data collection and observed scores for reporting. Questions remain as to how generalizable the forms and scores would be to the content domain, if based on this approach. The most generalizable results would be based on a system of arbitrary forms, with performance reported as the latent trait distribution, as is currently done with NAEP. However, this is also one of the most complex of the possibilities.
NAEP'S YEAR 2000 MARKET-BASKET STUDY: WHAT DO WE EXPECT TO LEARN?
John Mazzeo, executive director of ETS's School and College Services, told workshop participants that the ETS year-2000 study on the market basket was designed with three goals in mind: (1) to produce and evaluate a market-basket report of NAEP results; (2) to gain experience with constructing market-basket short forms; and (3) to conduct research on the
methodological and technical issues associated with implementing a market-basket reporting system. The study involves the construction of two test forms (also referred to as administrable or short forms) for grade four mathematics. Although these forms were designed to be parallel, some of the research will evaluate the extent to which the forms meet the necessary assumptions to be considered parallel.
According to Mazzeo, the test developers hope that the study will serve as a learning experience regarding the construction of alternate short forms. Whereas creating intact test forms is a standard part of most testing programs, this is not the case with NAEP. Due to its many content areas and the need to limit the length of the testing time, NAEP uses a matrix sampling design to obtain a representative sample of students taking each subject-area assessment. Under this design, blocks of items within each content domain are administered to groups of students, making it possible to administer a large number and range of items during a relatively brief testing period. Consequently, each student takes only a few items in a given content area—too few to serve as a basis for individual scores.
Because NAEP's current system for developing and field testing items was set up to support the construction of a system of “arbitrary” test forms in an efficient matter, it does not yet have guidelines for constructing market baskets or intact tests. That is why study of the creation of such forms is under way.
The short forms were constructed by a NAEP test development committee that had been instructed to try to identify a set of secure NAEP items that were high quality exemplars of the pool; that matched the pool with respect to content, process, format, and statistical specifications; and that could be administered within a 45-minute time period. The committee constructed two forms with approximately 30 items organized into three distinct blocks, each to be given during separately timed 15-minute test sessions. One of the short forms contains previously administered secure items; the other contains new items. Both forms will be given to a random sample of 8,000 students during the NAEP 2000 administration. These forms will be spiraled 2 with previously administered NAEP materials to enable linking to NAEP.
Mazzeo said that the year-2000 study is expected to result in three
Spiraling is an approach to form distribution in which one copy of each different form is handed out before spiraling down to a second copy of each form and then a third and so forth. The goals of this approach are to achieve essentially random assignment of students to forms while ensuring that an essentially equal number of students complete each form.
products: (1) one or more secure short forms; (2) a research report intended for technical audiences that examines test development and data analytic issues associated with the implementation of market-basket reporting; and (3) a report intended for general audiences.
At the time of the workshop, ETS's plans for the market-basket reports had not been formalized. According to Mazzeo, some of the features being considered include
National and state-level NAEP results (average scores and achievement level percentages) expressed in a market-basket metric (e.g. percent correct). The reporting of such results could be confined to “total-group ” scores or it could be extended to include national and state results by gender, race/ethnicity, parental education, and other standard NAEP reporting groups.
Release of all, or a sample, of the items that make up the short form as well as performance data. Mazzeo noted that the text of the items, scoring rubrics, and sample student responses might also be provided.
A format and writing style appropriate for a general public audience.
The research study will investigate market-basket reporting under two configurations, one in which the short form would be made available, and one in which it would not. ETS researchers will continue to study alternative analytic and data collection methods. One of the studies planned involves conducting separate analyses of the data using methods appropriate for arbitrary forms, methods appropriate for congeneric forms, and methods appropriate for parallel forms. Each of these sets of analyses will produce results in an observed score metric as well as a true score metric.
The study calls for comparing results from the arbitrary forms with results from other approaches to obtain empirical evidence about which data gathering options are most viable for the market-basket concept. These comparisons will focus on the degree of similarity among the sets of results. If the congeneric and parallel forms models (which are based on strong assumptions but involve less complex analytic procedures) produce the same results as the arbitrary forms model (which makes the weakest assumptions but involves the most complicated analysis), then the simpler data collection and analytic procedures may be acceptable. Comparisons of observed
score and true score results for each of the approaches will inform decisions about which type of reporting scale should be used.
The study will also provide data that can be used to evaluate context effects. The administration design will yield multiple estimates of item parameters for some of the market-basket items. Comparisons of the parameter estimates will enable investigation of the magnitude of context effects.
The year-2000 study will entail evaluation of the potential benefit of using longer market baskets. According to Mazzeo, the 31-item short forms were chosen out of consideration for school and student burden, increasing difficulties in obtaining school participation in NAEP, and the conviction that, “to be effective, a publicly released market basket of items should be of modest size.” Other decisions regarding test length could also be made, such as Darrell Bock's domain score reporting approach. Under this approach, the entire item pool is released, and the reporting scale is defined in terms of scores on the full item pool. Mazzeo reminded participants that a longer collection of items would permit more adequate domain coverage and produce more reliable results.