Standards-based reform is the centerpiece of recent efforts to improve primary and secondary education in the United States. The basic idea is that creating and aligning new, high standards for curriculum, instruction, and assessment for all students at every grade level will raise academic performance. Provisions in the 1994 reauthorization of Title I of the Elementary and Secondary Education Act, the Goals 2000 legislation, and the new Individuals with Disabilities Education Act (IDEA) all support and prompt standards-based education reform. Most states have or are developing both challenging standards for student performance and assessment that measure student performance against those standards.
Most state assessments provide information about the performance of schools and school districts, as well as individual students. However, there is no common scale that permits comparisons of student (or school) performance in different states with nationwide standards, like those of the National Assessment of Educational Progress (NAEP), or with the performance of students in other countries as indicated by the Third International Mathematics and Science Study (TIMSS). A new study from the National Research Council (1999c) concludes that it would not be feasible to develop such a common scale or to link individual score reports from existing tests to NAEP. In addition, a recent report by the U.S. General Accounting Office (GAO, 1998) explored reasons for discrepancies among states in the percentage of students who show satisfactory levels of achievement and the role that the Voluntary National Tests (VNT) might play in reducing these discrepancies.
In his 1997 State of the Union address, President Clinton announced a federal initiative to develop tests of 4th-grade reading and 8th-grade mathematics that could be administered on a voluntary basis by states and school districts beginning in spring 1999. The call for VNT echoed a similar proposal for “America's Test,” which the Bush administration offered in 1990. The principal purpose of the VNT,
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 5
--> 1 The Proposed Voluntary National Tests and Their Evaluation Standards-based reform is the centerpiece of recent efforts to improve primary and secondary education in the United States. The basic idea is that creating and aligning new, high standards for curriculum, instruction, and assessment for all students at every grade level will raise academic performance. Provisions in the 1994 reauthorization of Title I of the Elementary and Secondary Education Act, the Goals 2000 legislation, and the new Individuals with Disabilities Education Act (IDEA) all support and prompt standards-based education reform. Most states have or are developing both challenging standards for student performance and assessment that measure student performance against those standards. Most state assessments provide information about the performance of schools and school districts, as well as individual students. However, there is no common scale that permits comparisons of student (or school) performance in different states with nationwide standards, like those of the National Assessment of Educational Progress (NAEP), or with the performance of students in other countries as indicated by the Third International Mathematics and Science Study (TIMSS). A new study from the National Research Council (1999c) concludes that it would not be feasible to develop such a common scale or to link individual score reports from existing tests to NAEP. In addition, a recent report by the U.S. General Accounting Office (GAO, 1998) explored reasons for discrepancies among states in the percentage of students who show satisfactory levels of achievement and the role that the Voluntary National Tests (VNT) might play in reducing these discrepancies. Proposed Tests In his 1997 State of the Union address, President Clinton announced a federal initiative to develop tests of 4th-grade reading and 8th-grade mathematics that could be administered on a voluntary basis by states and school districts beginning in spring 1999. The call for VNT echoed a similar proposal for “America's Test,” which the Bush administration offered in 1990. The principal purpose of the VNT,
OCR for page 5
--> as articulated by the Secretary of the U.S. Department of Education (see, e.g., Riley, 1997), is to provide parents and teachers with systematic and reliable information about the key verbal and quantitative skills that students have achieved at two key points in their educational careers. The U.S. Department of Education anticipates that this information will serve as a catalyst for continued school improvement, by focusing parental and community-wide attention on achievement and by providing an additional tool to hold school systems accountable for their students' performance in relation to nationwide standards. The proposed VNT has evolved in many ways since January 1997, but the major features were clear in the initial plan. Achievement tests in English reading at the 4th-grade level and in mathematics at the 8th-grade level would be offered to states, school districts, and localities for administration in the spring of each school year. Other features include: The tests would be voluntary because the federal government would prepare but not require them, nor would data on any individual, school, or group be reported to the federal government. The tests would be distributed and scored through licensed commercial firms. A major effort would be made to include and accommodate English-language learners and students with disabilities in the testing program. The tests, each administered in two, 45-minute sessions in a single day, would not be long or detailed enough to provide diagnostic information about individual learning problems. However, they would provide reliable information so all students—and their parents and teachers—would know where they are in relation to high national standards and, in mathematics, in comparison with levels of achievement in other countries. The tests would be designed to facilitate linkage with the National Assessment of Educational Progress (NAEP) and the reporting of individual test performance in terms of the NAEP achievement levels: basic, proficient, and advanced. For the 4th-grade reading test, the standards would be set by the achievement levels of the corresponding tests of NAEP. For the 8th-grade mathematics test, corresponding standards would be set by the 8th-grade mathematics tests of NAEP. Comparisons with achievement in other countries would be provided by linking to results from the Third International Mathematics and Science Study. In order to provide maximum preparation and feedback to students, parents, and teachers, sample tests would be circulated in advance, and copies of the original tests would be returned with the students' original and correct answers noted. A major effort would be made to communicate test results clearly to students, parents, and teachers, and all test items would be published on the Internet just after the administration of each test. The VNT proposal does not suggest any direct use of VNT scores to make high-stakes decisions about individual students, that is, about tracking, promotion, or graduation. Representatives of the U.S. Department of Education have stated that the VNT is not intended for use in making such decisions, and the test is not being developed to support such uses. Nonetheless, some civil rights organizations and other groups have expressed concern that test users would inappropriately use VNT scores for these purposes. Indeed, under the plan, test users (states, school districts, or schools) would be free to use the tests as they wish, just as test users are now free to use commercial tests for purposes other than those recommended by their developers and publishers. A new National Research Council report (1999b:Ch.12) concludes: “The VNT should not be used for decisions about the tracking, promotion, or graduation of individual students.” The VNT plan also does not preclude the possibility
OCR for page 5
--> that the VNT would be used for aggregate accountability purposes at the level of schools, school districts, or states. Evaluation Plan A testing program of the scale and magnitude of the VNT initiative raises many important technical questions and calls for quality control throughout the various stages of development and implementation. Public debate over the merits of the program began even before any evaluation of the technical adequacy of the test design or content or administration would have been possible (see, e.g., Applebome, 1997). There are strong differences of opinion, for example, over such issues as the appropriate roles for federal, state, and local authorities in developing and governing such a program; probable and possible consequences of the tests on teaching and learning; efficacy for minority students, disadvantaged students, students with disabilities, and English-language learners; the quality of the information that the tests will provide to the public; the relationship of the tests to other state, local, national, and even international assessment programs; and the general concept of using standardized tests as a major tool for educational accountability. Policy debates are at times difficult to disentangle from arguments over technical properties of testing programs. The overall purpose of this evaluation is to focus on the technical adequacy and quality of the development, administration, scoring, reporting, and uses of the VNT. Through procedures designed to assure rigorous and impartial scientific evaluation of the available data, we have attempted to provide information about VNT development that will aid test developers and policy makers at the federal, state, and local levels. This phase 1 report focuses on: (1) specifications for the 4th-grade reading and 8th-grade mathematics tests, (2) the development and review of items for the tests, and (3) plans for subsequent test-development activities. The last includes plans for the pilot and field tests, for inclusion and accommodation of students with disabilities and English-language learners, and for scoring and reporting the tests. Note that we interpret our mandate as a request for technical review only, and we take no position on the overall merits of the VNT. Initial plans for the evaluation of the VNT, developed at the request of the Department of Education in late summer 1997, followed the department's initial schedule for the design, validation, and implementation of the tests. Following President Clinton's January 1997 State of the Union address, the schedule called for development of test specifications for 4th-grade reading and 8th-grade mathematics tests by fall 1997, pilot testing of test items later that year, and field testing of test forms early in 1998. The first test administration was slated for spring 1999. Subsequent negotiations between the administration and Congress, which culminated in passage of the fiscal 1998 appropriations bill (P.L. 105-78), led to a suspension of test item development (a stop-work order) late in September 1997 and transferred to the National Assessment Governing Board (NAGB, the governing body for NAEP) exclusive authority to oversee the policies, direction, and guidelines for developing the VNT. The law gave NAGB 90 days in which to review the development plan and the contract with a private consortium, led by the American Institutes for Research (AIR), for the development work. Congress further instructed NAGB to make four determinations about the VNT: (1) the extent to which test items selected for use on the tests are free from racial, cultural, or gender bias; (2) whether the test development process and test items adequately assess student reading and
OCR for page 5
--> mathematics comprehension in the form most likely to yield accurate information regarding student achievement in reading and mathematics; (3) whether the test development process and test items take into account the needs of disadvantaged, limited-English-proficient, and students with disabilities; and (4) whether the test development process takes into account how parents, guardians, and students will appropriately be informed about testing content, purpose, and uses. NAGB negotiated a revised schedule and work plan with AIR. It calls for test development over a 3-year period—with pilot testing in March 1999, field testing in March 2000, and operational test administration in March 2001. In addition, the work plan specifies a major decision point in fall 1998, which depends on congressional action, and it permits limited test-development activities to proceed through the remainder of the fiscal year, to September 30, 1998. When the Congress assigned NAGB responsibility for the VNT, it also called on the National Research Council (NRC) to evaluate the technical adequacy of test materials. Specifically, it asked the NRC to evaluate: (1) the technical quality of any test items for 4th-grade reading and 8th-grade mathematics; (2) the validity, reliability, and adequacy of developed test items; (3) the validity of any developed design which links test results to student performance levels; (4) the degree to which any developed test items provide valid and useful information to the public; (5) whether the test items are free from racial, cultural, or gender bias; (6) whether the test items address the needs of disadvantaged, limited-English-proficient, and disabled students; and (7) whether the test items can be used for tracking, graduation, or promotion of students. To carry out this mandate (specified in P.L. 105-78 [Sections 305–311], November 1997), the NRC appointed us as co-principal investigators. We have worked with NRC staff under the auspices and oversight of the NRC's Board on Testing and Assessment (BOTA) and solicited input from a wide range of outside experts. The congressional charges to NAGB and to the NRC were constrained by P.L. 105-78 requirements that “no funds … may be used to field test, pilot test, administer or distribute in any way, any national tests” and that the NRC report be delivered by September 1, 1998. The plan for pilot testing in March 1999 required that a large pool of potential VNT items be developed, reviewed, and approved by late fall of 1998, to provide time for the construction, publication, and distribution of multiple draft test forms for the pilot test. Given the March 1998 start-up date, NAGB, its prime contractor AIR, and the subcontractors for reading and mathematics test development (Riverside Publishing and Harcourt-Brace Educational Measurement) have faced a daunting and compressed schedule for test design and development, and we have been able to observe only a part of the item development process (and not its final products). Moreover, because preliminary testing and statistical analyses of test data are essential steps in the development and evaluation of new tests, our evaluation is necessarily preliminary and incomplete. We offer substantial, but preliminary, evidence about three of the seven issues in the congressional mandate to the NRC: (1) technical quality; (2) validity; and (5) bias. These correspond to the first and second determinations to be made by NAGB, freedom from bias and accuracy of information. We have also sought evidence about the other important issues raised in the congressional mandate. Our
OCR for page 5
--> discussion of them, except the last—“high-stakes” use of test items, that is, for tracking, graduation, or promotion of students—is even more preliminary and is, in large part, limited to an evaluation of the NAGB and AIR plans for the subsequent stages of test development. As noted above, another congressionally mandated study of the VNT concludes that the VNT should not be used for decisions about tracking, promotion, or graduation (National Research Council, 1999b), and we concur in that recommendation. We cannot address the remaining issues definitively unless test development continues into the pilot phase and empirical data on item performance are available for analyses. We note that our charge did not include cost issues, and we did not endeavor to examine probable costs for the VNT. Issues of cost are addressed by a recent report on the VNT by the U.S. General Accounting Office (1998). Evaluation Activities In this phase of the VNT evaluation, we have focused largely on three aspects of the test development process and products: issues surrounding the test specifications and the NAEP frameworks; plans for the development and implementation of the pilot study, scheduled for spring 1999; and preliminary evidence of the quality of possible test items. To date, we have observed laboratory-based talk-aloud item tryout sessions with students, reviewed design and development plans and reports, examined draft test items and scoring materials, and conducted three workshops at which additional experts with a wide range of skills and experience have contributed unique and invaluable input. (See Appendices A–C for lists of the expert participants in each workshop.) Our December 1997 workshop reviewed test specifications and linking plans. This meeting occurred prior to the revision of the NAGB and AIR workplan. The workshop enabled us and the NRC staff to review and assess the previously developed test specifications in relation to the NAEP mathematics and reading frameworks. The workshop also reviewed issues in test equating and linkage that would be relevant to the evaluation of linkages of the VNT in reading and mathematics with corresponding NAEP instruments, as well as to the mission of the Committee on Equivalency and Linkage of Educational Tests (see National Research Council, 1999c). We believe that the workshop helped to inform the subsequent modification of development plans by NAGB. Our second workshop in April 1998 reviewed pilot and field test plans. At that time, the revised development work plan was in place, and item development had resumed. The goal of this workshop was to review the contractors' plans for collecting and analyzing empirical data about the items and subsequent test forms after initial development was complete. The proceedings of that meeting, along with our review of the NAGB and AIR plans and revisions, have been very helpful in our evaluation. For our June 1998 workshop, we led an expert review of item quality, based on a sample of reading and mathematics items that had been developed as of that date. We planned the workshop to take place late enough in the development process so we could base this report on tangible evidence of item quality but early enough for us to produce the report in a timely fashion. In fact, as described below, the workshop yielded evidence that the item development schedule did not allow enough time for item review and revision and that there might be inadequate numbers of certain types of items. We issued an interim letter report on July 16, 1998 (National Research Council, 1998), which recommended changes in the item review and revision schedule and possible development of additional items within the overall constraint of approval of the item pool by NAGB in late fall 1998. We summarize the findings and recommendations from that report below, along with subsequent modifications of the development plan and schedule by NAGB and the AIR consortium. The evaluation of draft items by AIR in one-on-one think-aloud sessions with 4th- and 8th-graders
OCR for page 5
--> during May and June 1998—called cognitive laboratories—was a significant and innovative item development activity. The sessions were carried out as a complement to, and simultaneously with, other review processes, including professional content reviews by AIR and its subcontractors and bias and sensitivity reviews led by the subcontractors. The cognitive labs are a potentially valuable tool for test development, providing direct feedback to the developers about student understanding of items. For that reason we looked closely at the design and conduct of the labs and, in a more limited way, at the use of information from the labs in item review and revision. We also observed the bias and sensitivity reviews. Report Overview and Themes Ideally, if the test development process were fully informed by the desired properties of the tests and of tests results and if there were no other time or resource constraints, one would expect a process whose steps were geared precisely to standards for the targeted outcomes. For example, VNT results are to be reported primarily in terms of NAEP achievement levels, so those levels should inform the specification and development of the item pool. For the same reason, the length of the test should be sufficient to ensure a range of accuracy in reports of test results that would be understandable and informative to students, parents, and teachers. The goals of inclusion and accommodation of English-language learners and students with disabilities suggest the desirability of developing a test from the beginning with tasks that are accessible to all students and whose results are comparable among all students—with or without accommodation. Finally, the pilot and field test plans would yield exactly the data needed to address the problems of test construction, reliable estimation, freedom from bias, and comparability among all major populations of students. The VNT development process does not entirely meet these standards, but—relative to current professional and scientific standards of test construction—it has been satisfactory. That is, NAGB and its consortium of contractors and subcontractors have made a great deal of progress toward the goal of developing a VNT item pool of adequate size and of known, high quality. While we cannot determine in advance whether that goal will be met, we do find that the procedures and plans for item development and evaluation are sound. If the development process continues through pilot and field testing, we expect there will be clear answers about the size and quality of the item pool; the reliability, fairness, and validity of items; the accessibility and comparability of the tests for students with low English proficiency or disabilities; and the linkages among alternative test forms and of those test forms with NAEP and TIMSS. We can imagine that there might have been much less satisfactory outcomes at this stage of test development, and progress to date is no small achievement. There are, however, understandable reasons why the test development process has not met the highest possible standards. First, as we have already stated, the schedule has been somewhat compressed, relative to the usual time allowed for item development. There is some compensation for the compressed schedule in the fact that the test specifications are very similar to those of NAEP and that test development has been informed by the experience of NAEP item development. Second, the test design presented some novel problems for which there are no ready solutions. For example, the compressed schedule did not permit the fundamental development work that would be required to assure both inclusion and comparable validity of test scores for students with disabilities or those with limited English proficiency (see National Research Council, 1999b). Consequently, the developers chose to leave issues of inclusion and accommodation mainly untouched until completion of the pilot test. Furthermore while NAEP provides some good experience in reporting specific achievement levels, the fact that test results are not reported for individual students in NAEP makes
OCR for page 5
--> that task for VNT much more complex. In the case of the VNT, students, parents, and teachers will be able to compare actual test items with nominal achievement levels in the context of individual student performance, and this creates demands for face validity that do not exist in NAEP. Again, the developers appear to have used standard procedures for item development, leaving the match between potential items and achievement-level descriptions for a later phase of the development process. Third, the design of the tests and of their results has continued to evolve during the development process. For example, while the goal of reporting in terms of achievement levels has remained constant, there has so far been no decision about the possibility of also reporting scaled scores or ranges of scores. There has also been no decision about whether and how information will be provided for students below the basic achievement level. Fourth, the test development team is new, and its division of labor between NAGB and the prime contractor and between the prime and subcontractors is complex. Under these circumstances—essentially a trial run—one might well expect (and we have observed) less than perfect coordination of procedures, activities, and materials among development activities. Finally, some features of test design appear to have been determined administratively, ignoring possible implications for the validity or reliability of the test. For example, test length appears to have been determined in advance by beliefs about the potential test-taking burden. Experiments with existing subsets of NAEP items could be informative about test length, relative to desired accuracy, in various types of reports of student performance. Again, if the development process continues and the later stages go as planned, these problems may not jeopardize the quality of the final products of this round of VNT development. If the VNT program continues beyond the first round of development and testing, these issues should be addressed in later cycles of test development and evaluation. When the goals and products of the program are more clearly defined and the several partners of the development consortium gain experience in working together, the test development process will probably become better aligned to the program's planned outcomes. Since item development for the VNT is still at an early stage, part of our review focused on plans for future activities, including: (1) further item review; (2) the pilot test; (3) the field test, with plans for equating and linking; (4) test administration, including accommodations for English-language learners and students with disabilities; and (5) reporting plans. In each area, we found several issues that will need attention if the program is to succeed. Without timely attention to these concerns, future progress is unlikely to be satisfactory. The remainder of the report is in five chapters, each of which pertains to a major phase of test development: test specifications; item development, review, and revision; pilot and field test plans; inclusion and accommodation; and reporting. Of these development activities, only the first is now complete. We have a lot of evidence about item development and review activities, but the schedule precludes our complete evaluation of the item pool or draft test forms developed for the pilot test. In the case of the other three phases of development, we have assessed plans as of mid-July 1998. Our findings, conclusions, and recommendations are based on both input from the experts and other participants at our three workshops and our analysis and assessment of the published materials cited throughout this report and other documents made available to us by NAGB and AIR (see Appendix D).