Summing Up: Issues to Consider and Address
One of the objectives for convening the market-basket workshop was to hear individuals representing a variety of perspectives respond to the plans for the NAEP market basket. The earlier chapters of this report summarize the remarks, organizing speakers' comments into four broad categories: (1) the parallels between the NAEP market basket and the CPI; (2) the use of a representative set of items as a means for facilitating public understanding of the material tested by NAEP; (3) the use of percent-correct scores as the summary indicator to communicate performance on the set of released items; and (4) the use of a short form to enable comparisons of performance on state and local tests with performance on NAEP. In listening to the speakers' reactions and the discussion among workshop participants, the committee identified a number of themes that emerged from the interactions at the workshop. While no attempt was made to establish consensus on these themes, they are discussed in this chapter to further assist NAEP's sponsors in their decision making regarding implementation of the market basket.
LIMITATIONS OF THE CPI METAPHOR
During his market-basket workshop presentation, Richard Colvin of the Los Angeles Times questioned the analogies being drawn between the CPI and the NAEP market basket. He maintained that metaphors can be a very effective means for conveying the meaning of complex phenomena.
The image of a shopper filling up a shopping cart works very well for the CPI. Since the CPI has to do with how much products and services cost, the market basket is an appropriate metaphor. Colvin wondered, however, if it was the right metaphor to associate with student achievement:
NAEP already holds claim to the best metaphor—“The Nation's Report Card.” What would the report based on a representative set of items be called? This is not a trivial issue. The name must relate to something the public already understands. And, it must also be seen as complementing, not conflicting with, the NAEP as a whole. We in the press are going to call it something. We might call it a “Quiz,” a “Final,” or a sample of “What Students Need to Know. ” You need to choose what you think it ought to be called. And, if you don't, we will.
Colvin concluded by suggesting that a reference point is needed for the body of knowledge and skills that is to be part of the market basket. “If the market basket is analogous to what is covered by the entire NAEP, then what does NAEP represent?” he asked.
Although the concept of a market basket of goods and services is readily grasped, the construction and measurement of the CPI market basket is clearly a very subtle and complex undertaking. In addition, the current conception of the NAEP market basket differs from the CPI operation in several potentially important ways. First, the CPI market basket was conceived as a purely descriptive measure and is built using extensive data on actual consumer purchases. In contrast, the NAEP market basket would not passively measure what is actually being learned by school students. Rather, construction of the NAEP market basket would require normative judgments about the content of the domain and selection of representative items.
Second, regional and area differences in purchasing habits are reflected in the CPI local area indexes and limit comparability across geographical areas. For NAEP, the same collection of items will be used to summarize performance across geographical areas. In fact, this is one of the stated goals for the market basket, to facilitate reporting of regional-level results and making comparisons to national results. However, for the NAEP market basket, there will be no attempt to reflect regional differences in curriculum. Thus, students may be tested on concepts and skills that have not been covered by their instructional programs.
Finally, production of the CPI involves the development and execution of sample surveys designed specifically for pricing the CPI market basket. Computation of the CPI is not accomplished by simply embedding
market-basket questions in an existing consumer survey. The CPI experience suggests that subtle and difficult measurement issues may await efforts to incorporate the market-basket concept into the existing structure of NAEP.
RELEASING A LARGE REPRESENTATIVE SET OF ITEMS
Many workshop participants commented on the utility of public release of a large representative set of NAEP items. They thought that such a release could potentially impact education reform by allowing teachers, administrators, curriculum specialists, and assessment directors to use the items in discussions about their instructional practices, curricular changes, and state and local assessments. Representatives from the First in the World Consortium offered examples of the ways in which released material might be used to further education reform efforts.
Discussants also suggested that such a release would be useful for increasing public awareness about the content and skills tested on NAEP. While this is certainly a well-intended objective, consideration should be given to the extent to which the public would take advantage of a large release of items and the inferences they might make. Would parents and others be willing to spend time reviewing large numbers of items? What would they think about the material they were seeing? Would NAEP's sponsors offer guidance to help them understand the content and skills the items are intended to assess? Simply placing a large number of items in the hands of the public would not necessarily enhance understanding. It will be important to consider the mechanisms that will be used to communicate with the public about the content and skills covered by the items. It may be enlightening to consider other testing programs' experiences with disclosing test forms and providing practice tests.
PERCENT CORRECT: NOT AS SIMPLE AS IT SOUNDS
A clear message from the workshop discussants was the deceptive complexity of the percent correct metric. One factor contributing to its complexity is the denominator of the percent-correct ratio; that is, percent correct of what? Total questions on the test? Total points on the test? Total content in the domain?
A second concern voiced by participants was the meaning attached to the percent correct score. Speakers cited a disconnect between the public
perception of what constitutes a passing score and the actual percent correct scores that would be associated with the basic, proficient, and advanced achievement levels. They argued that the public is accustomed to seeing a letter grade attached to percent correct, and the temptation for the media and the general public to translate percent-correct scores to grades would be overwhelming.
A third issue raised was the comparability of percent correct scores with NAEP scores. NAEP currently reports results as scaled scores derived from IRT-based latent trait estimates. While the statistical machinery exists to transform the latent trait scale to a percent correct scale, the procedures are very complex and time consuming. Would the move toward percent correct scores be worth it, given the difficult procedures involved and that it might not lead to the desired improvements in understanding of NAEP results?
LINKING SHORT FORM RESULTS TO NAEP AND TO STATE AND LOCAL ASSESSMENTS
The most common use cited for the short form was embedding it into state and local assessments, thus providing states and localities with a “link” to NAEP. A previous NRC committee studied this subject in depth by examining several scenarios for embedding, including the embedding of representative blocks of NAEP material (National Research Council, 1999a). The committee maintained that while using representative blocks of material would help increase the comparability of scores across states, many issues would remain unresolved. They identified a number of factors that would bear on the comparability of scores, including NAEP's use of conditioning to estimate performance, likely misalignment of local curricula with NAEP, the contextual circumstances of testing within a given state or district, students' and administrators ' motivation levels, administrative conditions, time of testing, and differing criteria for excluding students from participation (e.g., disabilities or limited English proficiency). These factors led the NRC's Committee on Embedding Common Test Items in State and District Assessments to conclude that:
Embedding part of a national assessment in state assessments will not provide valid, reliable, and comparable national scores for individual students as long as there are: (1) substantial differences in content, format, or administration between the embedded material and the national test that it represents: or (2) substantial differences in context or administration between the state and
national testing programs that change the ways in which students respond to the embedded items (National Research Council, 1999a:3).
This finding closely parallels an earlier conclusion reached by another NRC committee, the Committee on Equivalency and Linkage of Education Tests, which stated:
Under limited conditions it may be possible to calculate a linkage between two tests, but multiple factors affect the validity of inferences drawn from the linked scores. These factors include the content, format, and margins of error of the tests; the intended and actual uses of the tests; and the consequences attached to the results of the tests. When tests differ on any of these factors, some limited interpretations of the linked results may be defensible while others would not be (National Research Council, 1999b:5).
As thinking about the design and intended uses of the short form proceeds, it is important to keep in mind the findings from these two committees.
OTHER ISSUES TO CONSIDER AND RESOLVE
Workshop participants brought up a number of other issues related to the development of the NAEP market basket. These issues bear on practical matters related to developing the market basket as well as unintended consequences that may be associated with its implementation.
One of the stated goals for NAEP's short forms is to make assessments available in subjects and grades not assessed every year. However, it is possible that NAEP could end up being a victim of its own success. If plans for the short form were successful, states and districts would have a test to administer in NAEP off-cycle years and could have easily derived scores comparable to the NAEP scale. Why, then, would they need to participate in NAEP? If they could do this in off-cycle years, why not do it every year?
NAEP has invested a considerable amount of time and money in the development of two short forms. Future work will be needed to score the short forms, to devise a mechanism for comparing percent correct scores with main NAEP scores, and to develop reporting procedures. If the short forms proceed to the stage of operational use, continued development of
additional forms will be needed. But to what extent will this process result in more useful, more understandable results? To what extent will the market basket produce the desired outcomes? At a time when only limited funding is being made available for educational purposes, is this the best use of funds? The costs and benefits of the market basket should be carefully considered.
Retrofitting the Design
Originally, NAEP was developed as a survey of what American school children know and can do. The frameworks cover broad content areas. The content areas are combined with other item characteristics (such as item type, item difficulty, and cognitive process) to form a test blueprint matrix. For mathematics, this matrix has some 60 cells. Currently, no one student takes sufficient items to represent the matrix fully. Instead, a matrix sampling procedure is used to assign items to blocks, blocks to forms, and forms to students. A single student takes three blocks of items.
This sort of test assembly is very different from that typically used in tests developed for educational purposes, where a test form that has the proper mix of content and item type to represent the test specifications is the end result, and a given student takes the entire test. Construction of the short form would require this other type of development. NAEP's current frameworks and existing item pools were not created with this type of development in mind. Limitations on the amount of time schools have for testing places restrictions on the number of items that can be administered. And NAEP's current frameworks and existing item pools, which are very broad, may not be able to be represented with the type of test that can be administered in a 45-minute session. In fact, as noted by John Mazzeo, test specifications had to be generated in order to assemble the short forms, and these specifications were based on an examination of the characteristics of the item pool and what it would support.
One key issue to emerge from the workshop is the need for explicit consideration of the ramifications of building a new system by manipulating the features of a pre-existing system. During their presentations, both John Mazzeo and Patricia Kenney expressed concern about the difficulties associated with trying to retrofit a pre-existing reporting and data collection system to new purposes and needs, particularly when the pre-existing system was not originally designed for such purposes and needs. It is important to keep these cautionary words in mind.
Changing NAEP's Purpose
During his presentation at the workshop, David Thissen of the University of North Carolina-Chapel Hill used Holland's characterization of “testing as measurement” versus “testing as a contest” to describe different purposes for testing. When thinking of testing as measurement, the goal is to make the appropriate inferences; that is, to measure performance as accurately as possible. When thinking of testing as a contest, the goal is get the highest scores possible. According to Thissen, NAEP's current procedures treat testing as measurement, seeking to obtain the most accurate estimates of student performance. Implementation of procedures that involve the short form will move NAEP into the category of testing as a contest.
Testing as a contest is high-stakes testing. NAEP traditionally has been a low-stakes test, since decisions about schools, teachers, and individuals have not been based on test results. As reporting moves to smaller units, the stakes increase, as does pressure and motivation to do well. Motivation to do well will, undoubtedly, affect performance. Thus, it is not clear how comparable data from national NAEP (taken under low-stakes conditions) will be with local results based on the short form (taken under high-stakes conditions). This undermines one of the main goals articulated for the short form— to facilitate comparisons with national benchmarks. Again, NAEP's sponsors should consider these potential consequences as policy and decision making about the market basket proceeds.