Simplifying NAEP's Technical Design: The Role of the Short Form
A third component of the NAEP market basket is the concept of the NAEP short form. Under this notion, the short form would be the vehicle for releasing items, providing a set of questions that could serve as an administrable test form. One or more versions of the short form would be released for public use, while other versions of the short form would be kept secure for use in conjunction with national NAEP and, perhaps, with state and local assessment programs.
To guide policy and decision making on the measurement issues pertaining to the short form, NAGB adopted the following principles (National Assessment Governing Board, 1999a):
Principle 1: The NAEP short form shall not violate the Congressional prohibition to produce, report, or maintain individual examinee scores.
Principle 2: The Board shall decide which grades and subjects shall be assessed using a short form.
Principle 3: Development costs, including item development, field testing, scoring, scaling, and linking shall be borne by the NAEP program. The costs associated with use, including administration, scoring, analysis, and reporting shall be borne by the user.
Principle 4: NAEP short forms intended for actual administration should represent the content of corresponding NAEP assessment frameworks as fully as possible. Any departure from this principle must be approved by the Board.
Principle 5: Since it is desirable to report the results of the short form using the achievement levels, the content achievement level descriptions should be considered during the development of the short form.
Principle 6: All versions of the short form should be linked to the extent possible using technically sound statistical procedures.
The proposed short form was the topic of considerable discussion during the workshop. The text below attempts to capture the discussions, highlighting the issues that seemed most important to participants. Addressed first are speakers' comments regarding potential uses of the short form and the data gathered from it. Addressed later are problems associated with the short form.
POTENTIAL USES OF THE SHORT FORM
Benchmarking and Other Comparisons
In September 1999, the Committee on NAEP Reporting Practices held a workshop on reporting district-level NAEP results. One of the clearest messages from participants in the workshop was that states and local jurisdictions want to be able to make comparisons of their achievement test results—comparisons with other jurisdictions and comparisons against national benchmarks. At present, state assessment programs enable within-state comparisons among schools and districts, but they do not allow for comparisons across state boundaries. State NAEP enables comparisons of achievement results from state to state but does not allow for comparisons among districts and schools, since results are not reported at the district and school levels.
District-level workshop participants indicated that comparisons would serve a number of important purposes. For example, comparisons among districts that share common social, economic, and demographic characteristics would help policymakers set reasonable expectations for student achievement. They also would allow districts to identify other districts like them that are performing better, thereby, stimulating discussions about education practices that work well (National Research Council, 1999d).
Workshop participants were also interested in having an external barometer against which to validate results from state and local assessments. Local jurisdictions were attracted to the prospect of being able to compare their students' performance to national benchmarks. They felt that having
such information would open up discussions about local standards, curricula, and assessment programs (National Research Council, 1999d).
Participants in the committee's market-basket workshop voiced similar interests and concerns. They liked the idea of having school-level or district-level results that could be compared to NAEP. Many had heard requests from their state legislators for national data to be used as benchmarks in setting goals for improving student achievement. According to Marlene Hartzman, an evaluation specialist with the public schools in Montgomery County, Maryland, “We have more data than we need, but we don't have what we need—a national benchmark.” They considered the short form to be the mechanism for obtaining benchmarking data, assuming the short form would yield school-level and district-level results.
Speakers commented that benchmarking data would help school administrators assess students' strengths and weaknesses and would enable them to target areas for improvement. Open discussion about weak areas could serve to identify education practices that work. Participants also pointed out that the short form might be a means for encouraging schools to participate in NAEP because it could be used to give schools and districts feedback on their students' NAEP performance, something NAEP does not currently provide.
Embedding the Short Form in Existing Assessments
Prior to the market-basket workshop, discussants were asked to consider the ways they would use the short form, if it were available. Many said that they would want to “embed” the short form in their state or district assessments to obtain results that could be compared with both the local assessments and with NAEP. Marilyn McConachie expanded on this idea for using the short forms, saying, “If these forms could be embedded into state tests, this would help us considerably in two ways: linking to NAEP and providing a strong sample from our state [by supplementing the sample selected for NAEP participation]. Linking to NAEP would help meet the state accountability policy's requirement for national benchmarking.”
In preparation for the workshop, Joseph O'Reilly, past president of the National Association of Test Directors, conducted an informal survey of some test directors. Highlighting his findings, O'Reilly stated that:
Overall, the test directors . . . were almost unanimous in support of a short form or market-basket form of NAEP if it could be incorporated into the state assessment system. I think that test directors are assuming that the proposed short form would be 10-15 items that could be used to scale the rest of the items on a NAEP scale, just as one embeds items on different levels of a test so that you can obtain a common scale across forms or grades.
O'Reilly reported that respondents saw great value in obtaining normative data on NAEP-like tests but were adamant about incorporating items from a short form into existing tests· He found that they wanted the information a short form would provide but would not support additional testing (or additional time for testing) to obtain it. Workshop participants expressed similar viewpoints saying they would consider administering the short form as a separate, common test given to all students but added that they would have to replace one of their regular assessments to do so.
Comparing Local Curricula and Assessments with NAEP
Some discussants noted that, although their schools had participated in NAEP, the results had been of little value because the relationship of NAEP to instructional programs has not yet been established. They emphasized the importance of alignment between curricula and assessment, pointing out that assessment results are of little use if based on material not covered in instructional programs. Ronald Costello described the role of assessment in school reform efforts:
[A]ssessment is only one aspect of the three parts of what states and school districts are using testing to accomplish. The other two are standards and accountability, and we are only beginning to justify . . . the time taken away from instruction [to] serve those ends. Unless it can be connected to state and local curriculum and instructional practices, there will be little value [in continued participation] in NAEP. It doesn't matter how good the assessments . . . [are if] they can't be connected to standards and accountability in our states and school districts. For the market basket to have value at the state and local level, it must add value to what we do in schools to improve student learning. . . . [We must] be able to use the information in the change process.
Participants thought that the short form would provide relevant information for school systems considering changes. The released short form would permit educators and policymakers to see first hand the material included on the test. Their review of the released material would promote discussions about what is tested and how it compares with the skills and material covered by their own curriculum. The secure short form would
yield data that could further these discussions. Educators could examine student data (even if it were aggregated to the school or district level) and evaluate performance in relation to their local practices. They could engage in discussions about their curricula, instructional practices, and sequencing of instructional material, and could contemplate changes that might be needed.
Stimulating Discussions with Teachers
As described earlier in this report (see Introduction and Chapter 4), members of the First in the World Consortium participated in TIMSS and used the results to learn how the consortium schools fared against worldclass standards and to examine and revise curriculum, instructional strategies, and assessment practices. According to First in the World representatives, Paul Kimmelman and Dave Kroeze, a key component of the consortium's efforts was the establishment of learning networks that allowed teachers and administrators to participate in the reform discussions and improvement efforts (Hawkes et al., 1997).
The consortium established a research agenda covering four broad areas: student performance; curriculum and instruction; instructional practices; and teacher characteristics. An essential aspect of the research agenda was the involvement of teachers and administrators. Teachers and administrators reviewed their coverage of relevant content, amounts of instruction in specific areas, and the depth of understanding expected of students. They also studied teachers' attitudes and beliefs about instruction, the amount and type of homework assigned, and the extent to which teachers were using computers and calculators.
Teams of teachers from the participating schools examined students ' performance on the topics covered on the TIMSS assessment. While they did not have access to the actual test items, they did have data on performance in specific topic and content areas. They used the results to evaluate their students' performance on the topics compared to students in other nations and considered when and how the topics were covered in their curricula. Kimmelman and Kroeze felt that discussions with and among teachers represented some of the most valuable outcomes of the consortium's participation in TIMSS.
Out-of-Cycle NAEP Testing
While the initial plan is to pilot test a short form for fourth grade mathematics, if the pilot is successful, NAGB may extend use of short forms to other grades and other subject areas. Because NAEP does not currently administer every subject to every grade every year, workshop participants suggested that short forms would help fill in the gaps; that is, they could be used to survey students in grades, subjects, and times that are not part of the regular NAEP schedule. At present, for example, the fourth grade mathematics assessment occurs every four years as part of state NAEP. If available, a short form in fourth grade mathematics could be given in the “off-years, ” thereby enabling compilation of yearly trend data. If short forms were produced for subjects not tested as part of state and local assessments, then states and districts could use the short forms to expand their assessment programs.
PROBLEMS WITH THE SHORT FORMS
Scoring: Faster, Easier, Better?
One advantage cited for the short forms was that they could be faster and less expensive to score than traditional NAEP assessments, providing score distributions without NAEP's usual complex statistical methodologies. Although it is not clear what NAGB's policy would eventually be with regard to how scores on the short form would be derived, workshop participants discussed scoring advantages—and disadvantages—at length. The text below highlights their comments.
NAEP uses a complex statistical process for scoring to compensate for the fact that no one student takes the full assessment. Since no one student responds to a sufficient number of items in a given content area to produce reliable estimates of performance, ability estimates are not computed for individuals. Instead, a conditioning process is used to generate the likely ability distributions for each individual based (or “conditioned”) on background characteristics and responses to cognitive items. Five ability estimates (or “plausible values”) are drawn from the distributions as estimates of the individual's proficiency level. These “draws” are aggregated over individuals to produce estimates of group-level distributions of performance.
There is a fundamental difference between NAEP's process and the process used by tests that are designed to produce individual scores: tests that produce individual scores do not condition. In fact, test users would most likely reject conditioning were it used to derive individual scores. As a result of conditioning, an individual's performance is adjusted according to the way others like him or her perform, others who share common characteristics such as gender, race, ethnicity, and socioeconomic status. This process is justifiable when the purpose of an assessment is to estimate group-level performance, but it is not typically used for tests that generate individual scores. In addition, test results based on conditioning are not comparable to unconditioned results.
One option for using the short form would make it available to states, districts, and schools to administer as they see fit. Scores on the short form would be generated for individuals then aggregated to provide group-level data in the percent correct metric. While this would circumvent NAEP's complex statistical procedures, it also means that short-form scores would not be conditioned. Hence, short-form results would not be comparable to the regular NAEP-scale results.
During their opening presentations, John Mazzeo and Robert Mislevy described methods that could be used to achieve NAEP-comparable results from the short form. These methods would use complex and lengthy statistical procedures. Given some of the complexities involved in producing scores that would be directly comparable to NAEP, several speakers questioned the extent to which it is critical to place the market-basket results and NAEP on exactly the same scale. In the words of one speaker, “Would it not be sufficient to provide results that are only somewhat NAEP-like?”
Reliability and Generalizability
The items selected for the short form are intended to represent NAEP 's fourth grade mathematics frameworks. One might expect, therefore, the scores on the short form would be generalizable not only to the set of questions at hand but also to the content domain from which the items were drawn. At the workshop, Darrell Bock of the University of Chicago expressed concern about the reliability and generalizability of scores based on the short form. One of the two pilot test versions of the short form contains 31 items and the other 33. Bock estimated that the reliability of a professionally developed 31-item test would likely fall in the low .80 range, a value judged to be too low when tests are being used to make decisions
about individuals. He stressed that the more germane concern is not about reliability, but about generalizability. Can 31 items adequately represent the content domain? He reminded participants that fourth grade mathematics crosses content strand with process category with item type. The result is a matrix with about 60 cells. While it is possible to represent these cells under the current matrix sampling approach for NAEP, how well, he asked, could a 30-some item test represent these cells?
In preparation for the workshop, Patricia Kenney, co-director of the NCTM NAEP Interpretive Reports Project, considered the feasibility of creating market-basket forms that matched the grade four NAEP mathematics assessment on the basis of content strand coverage, ability category, and item type. Based on material in John Mazzeo's paper, Kenney reported that the market-basket forms appear to represent the frameworks in terms of the content strand. But, she called attention to the fact that the grade four mathematics framework covers 56 topics and subtopics. Like Bock, Kinney questioned the extent to which 30 items would be able to represent the frameworks at the topic or subtopic level.
Kenney pointed out an additional potential problem. In NAEP, items can be administered at more than one grade level. For example, an algebra item might be given at grades four and eight to facilitate measuring growth between grade levels. Because NAEP results are not reported at the student level, students are not disadvantaged by including topics that they may not have studied. The problem with these “grade overlap” items, however, is that they might become misinterpreted as NAEP recommendations. For instance, suppose an algebra item appeared in the fourth grade mathematics form. Would this imply that schools should teach algebra to fourth graders?
Kenney's most overarching concern was with “retrofitting,” that is, manipulating the features of an existing system to adjust for new purposes and uses. In the case of mathematics, neither the existing NAEP framework nor the item pool was developed with market-basket reporting in mind. Therefore, Kenney wondered if the existing materials would support such procedures. She reminded participants that the mathematics test development committee “was not able to assemble a collection of items that they felt were all exemplary of the framework and met all statistical, content, and process, and format specifications ” (Mazzeo, 2000:11). Kenney repeated Mazzeo's concerns about retrofitting saying, “In my own view, it is often more difficult and risky to attempt to retrofit a reporting and data collection system to an existing assessment that was not designed
for such purposes than it is to build such an assessment system from scratch” (Mazzeo, 2000:27).
Level of Disaggregation
Workshop participants discussed the utility of disaggregated data, and questions arose as to the level of disaggregation that would be permitted. Carroll Thomas, superintendent of schools for Beaumont, Texas, emphasized the importance of having data for various population groups (e.g., data separated by racial/ethnic, gender, or other background characteristics). He pointed out that the conclusions one draws based on seeing total-group results could be very different from the conclusions one might draw based on results for various population groups. Thomas said he believes that decisions about changes in educational practices should be based on examining disaggregations of group-level data. Some participants asked if results would be reported by background characteristics. Others inquired whether results would be made available by school or only by district. Generally, participants believed that disaggregation will enhance the utility of results.
Other participants spoke about having individual results. NAGB's approved policy with regard to the short form explicitly prohibits reporting individual scores. While individual results would be generated initially, they would need to be aggregated for reporting purposes. The prohibition against producing individual results based on the short form stimulated considerable discussion. The short form could be administered to all children in a specific grade in a manner closely resembling other testing in schools—testing that results in individual score reports. How would one account for not having individual scores? Assessment directors and policymakers at the workshop maintained that the situation would indeed be difficult to explain to interested parties, particularly parents. Many felt the temptation to generate individual scores would be great, and difficult to resist, despite NAGB's prohibition. Further, in a scenario in which the short form was embedded into an existing assessment, participants wondered if the items would contribute to the individual scores generated by the existing assessment.
Under the current NAGB plan, two short forms would be produced, one for public release and the other kept secure and retained for use by states and districts. While school administrators might be able to control the generation of individual results from the secure form, the released form
would be publicly available for any, and all, uses. In commenting on this, Richard Colvin of the Los Angeles Times described probable uses of the released test and suggested ways to handle derivation of individual scores:
No matter what caveats you offer, people will take the test and will calculate a percent correct score for their performance. I can guarantee that the [LA] Times would post such a test on its web site. . . . Americans are used to taking tests in magazines and comparing their performance to a scale. Despite your caveats, schools will have their students take it, and they will calculate a percent correct score. You won't be able to stop that so you need to figure out what to say about it. It would be better if there were a way to have a conversion scale of some sort. Another idea would be to set up a web site of your own where people could take the tests on-line. Then, perhaps there 'd be a way to actually produce a score, based on which questions were answered correctly.
Embedding the Short Form in State and Local Assessments
The most common projected use of the short form cited by policymakers and directors of assessment was embedding it in state and local assessments. This potential use prompted considerable discussion. Scott Trimble of Kentucky's Department of Education pointed out that a given state 's curriculum might not be completely congruent with the NAEP frameworks. It might be the case, for instance, that the state's curriculum includes areas not tested by NAEP, in which case the state assessment would have to cover the areas not covered on the short form. Or it might be that the NAEP form tests areas not covered by the curriculum, in which case, students would not have been taught the skills and knowledge being tested. Testing students on material they have not been taught results in less-than-useful measures of achievement.
Some workshop discussants raised the question of how students' motivation to do well might factor into performance on the short form. State and local assessments tend to be high-stakes exams that carry consequences for those who do not perform well; thus, motivation to do well is high. At present, NAEP is not a high-stakes test. Administering a NAEP short form as part of a high-stakes assessment program would change the context in important and relevant ways. While there are still unanswered questions about the effects of motivation on assessment results, the introduction of higher stakes could render results from the short form incomparable to
national or state NAEP results and call into question the types of inferences that might be made.
Testing burden was also a concern to participants. Many judged that it was unlikely they could introduce additional assessments into their states and districts nor could they sacrifice more instructional time for testing. Thus, NAEP items, in essence, would need to do “double duty.” That is, to prevent test administration from taking any more time than it already does, NAEP items would need to count toward the score on the short form and also replace items currently on state and local tests that measure similar skills and content.
Such uses of the NAEP items bring to the forefront issues about linking state and local assessments to NAEP. Several discussants referenced reports from two earlier NRC committees, the Committee on the Equivalency and Linkage of Educational Tests and the Committee on Embedding Common Test Items in State and District Assessments (National Research Council, 1999d; 1999a). Both committees studied issues associated with linking state and local assessments to NAEP. And, after in-depth exploration, both committees concluded that many problems surround attempts to link assessments not initially designed for the purposes of linking.