The origins and evolution of the National Assessment of Educational Progress provide the necessary backdrop to our analysis of NAEP' s mission and measurement objectives, its design, and its governance and management structure.
In 1963 Francis Keppel, then U.S. Commissioner of Education, appointed a committee to explore options for assessing the condition and progress of American education. The committee's chair, Ralph Tyler (1966:95), described the need for a base of information to help public officials make decisions about education:
. . . dependable information about the progress of education is essential. . . . Yet we do not have the necessary comprehensive and dependable data; instead, personal views, distorted reports, and journalistic impressions are the sources of public opinion. This situation will be corrected only by a careful, consistent effort to obtain data to provide sound evidence about the progress of American Education [italics added].
In 1966 the Keppel committee recommended that a battery of tests be developed to the highest psychometric standards and with the consensus of those who would use it. NAEP was conceived to provide that information base, to monitor the progress of American education (National Center for Education Statistics, 1974; U.S. Congress, 1992)
The design of the original battery reflected the political and social realities of the time (National Assessment Governing Board, no date). Prominent among these was the resistance of state and local policy makers to a national curriculum; local leaders feared federal erosion of their autonomy and voiced concern about pressure for accountability. NAEP's designers responded by defining testing objectives for NAEP that were too expansive to be incorporated in any single curriculum. They specified that results be reported for specific test exercises, not in relation to broad knowledge and skill domains. Tests were developed for and administered to 9-, 13-, and 17-year-olds rather than to individuals at specific grade levels. These features, combined with matrix sampling—which distributed large numbers of items broadly across school buildings, districts, and states, but limited the number of items given to individual examinees—thwarted perceptions of NAEP as a federal testing program addressing a nationally prescribed curriculum. Indeed, NAEP's design provided nationally and regionally representative data on the educational condition of American youth while avoiding any implicit federal standards or state, district, and school comparisons. NAEP was described as the nation's education barometer.
Over the following decade, the educational landscape changed. Schools across the United States developed new programs to respond to various federally sponsored education initiatives. The Elementary and Secondary Act of 1965 established mechanisms through which schools could address the learning needs of economically disadvantaged students. In the ensuing years, federal support expanded to provide additional resources for students with limited English proficiency, for example, and students with disabilities. As federal initiatives expanded educational opportunities at the local level, however, they fostered an administrative imperative for assessment data to help gauge the effect of these opportunities on the nation's education system.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 3
Evaluation of “Redesigning the National Assessment of Educational Progress” Evaluation of “Redesigning the National Assessment of Educational Progress” BACKGROUND The origins and evolution of the National Assessment of Educational Progress provide the necessary backdrop to our analysis of NAEP' s mission and measurement objectives, its design, and its governance and management structure. NAEP's Origin and Evolution In 1963 Francis Keppel, then U.S. Commissioner of Education, appointed a committee to explore options for assessing the condition and progress of American education. The committee's chair, Ralph Tyler (1966:95), described the need for a base of information to help public officials make decisions about education: . . . dependable information about the progress of education is essential. . . . Yet we do not have the necessary comprehensive and dependable data; instead, personal views, distorted reports, and journalistic impressions are the sources of public opinion. This situation will be corrected only by a careful, consistent effort to obtain data to provide sound evidence about the progress of American Education [italics added]. In 1966 the Keppel committee recommended that a battery of tests be developed to the highest psychometric standards and with the consensus of those who would use it. NAEP was conceived to provide that information base, to monitor the progress of American education (National Center for Education Statistics, 1974; U.S. Congress, 1992) The design of the original battery reflected the political and social realities of the time (National Assessment Governing Board, no date). Prominent among these was the resistance of state and local policy makers to a national curriculum; local leaders feared federal erosion of their autonomy and voiced concern about pressure for accountability. NAEP's designers responded by defining testing objectives for NAEP that were too expansive to be incorporated in any single curriculum. They specified that results be reported for specific test exercises, not in relation to broad knowledge and skill domains. Tests were developed for and administered to 9-, 13-, and 17-year-olds rather than to individuals at specific grade levels. These features, combined with matrix sampling—which distributed large numbers of items broadly across school buildings, districts, and states, but limited the number of items given to individual examinees—thwarted perceptions of NAEP as a federal testing program addressing a nationally prescribed curriculum. Indeed, NAEP's design provided nationally and regionally representative data on the educational condition of American youth while avoiding any implicit federal standards or state, district, and school comparisons. NAEP was described as the nation's education barometer. Over the following decade, the educational landscape changed. Schools across the United States developed new programs to respond to various federally sponsored education initiatives. The Elementary and Secondary Act of 1965 established mechanisms through which schools could address the learning needs of economically disadvantaged students. In the ensuing years, federal support expanded to provide additional resources for students with limited English proficiency, for example, and students with disabilities. As federal initiatives expanded educational opportunities at the local level, however, they fostered an administrative imperative for assessment data to help gauge the effect of these opportunities on the nation's education system.
OCR for page 3
Evaluation of “Redesigning the National Assessment of Educational Progress” NAEP's original design could not accommodate the increasing demands for data about federal education innovations. Its reporting scheme, for example, allowed for the measurement of change on individual exercises, but not on the broad content domains that were evolving. Furthermore, age-level (rather than grade-level) testing made it difficult to link NAEP results to state and local education policies and school practices. Increasingly, NAEP was asked to provide more detailed information so that government and education officials would have a stronger basis for judgments about school effectiveness; NAEP' s constituents were seeking information that, in many respects, conflicted with the basic design of the program. Redesign of the Original Plan The first major redesign of NAEP was implemented in 1984, when its development and administration moved from the Education Commission of the States to the Educational Testing Service. The design for NAEP's second generation (Messick et al., 1983), with its changes in sampling, objective-setting, exercise development, data collection, and analysis, reflected the growing federal role in American education. The introduction of balanced incomplete block designs for matrix sampling, model-based approaches to item scaling within content domains and across age and grade cohorts, and statistical adjustments based on collateral information about examinees afforded NAEP much greater flexibility in responding to policy demands as they evolved. Almost concurrently, A Nation at Risk (National Commission on Excellence in Education, 1983) warned that America's schools and its students were performing below expectation. The report's publication spawned a wave of state-level education reform. As states invested more and more in their education systems, they sought information about the effectiveness of their efforts. In the face of rising costs and multiple demands, policy makers looked to NAEP for guidance on the effectiveness of alternative practices. The National Governors' Association then called for state-comparable achievement data, and a new report, The Nation's Report Card (Alexander and James, 1987), recommended that NAEP be expanded to provide state-level results. This recommendation was a dramatic departure from the original NAEP model. Soon thereafter, participants in the 1989 Education Summit in Charlottesville, Virginia, challenged the prevailing assumptions about national expectations for achievement in America's schools. President Bush and the nation 's governors established six national goals for education (America 2000, 1991). Goal three specified the subjects and grades in which progress should be measured with respect to national and international frames of reference. By design, these subjects and grades paralleled NAEP 's structure. The governors called on educators to hold students to “world-class” knowledge and skill standards. The governors' commitment to high academic standards included a call for the articulation of NAEP results by achievement levels and performance standards. In addition to describing what students know and can do, NAEP was being asked for judgments about the adequacy of observed performance. In the governors' terms, NAEP was asked to test not only what students currently know and can do, but also what young people should know and be able to do. Current Design and Governance NAEP surveys the achievement of students at ages 9, 13, and 17 and in grades 4, 8, and 12. The current program calls for assessment in geography, reading, writing, mathematics, science, U.S. history, world history, the arts, civics, and other academic subjects. Three subjects are tested at each biennial administration. As many as 26 different nonparallel test booklets are used at each age and grade level. During the 1990s, two subjects will have been tested twice in the main assessment, six subjects once, and two subjects not at all. At each administration, three sets of batteries are given: main NAEP, trend NAEP, and state NAEP. Between 150 and 170 distinct subsamples are drawn for each NAEP administration. The characteristics of score distributions are estimated with complex statistical methods, such as conditioning and multiple imputation of plausible values, which are based on sophisticated scaling models. Results are reported in terms of scaled scores, percentiles, anchor points with exemplar items, and NAGB achievement levels with exemplar responses. The 1992 NAEP mathematics report included seven volumes and over 1,800 pages. Given this complexity, it is perhaps not surprising that anomalies have arisen in recent assessments (U.S. General Accounting Office, 1992) and that controversy has plagued the development and reporting of results using performance standards (National Academy of Education, 1993; U.S. General Accounting Office, 1993). NAEP's multiplicity of purpose has resulted not
OCR for page 3
Evaluation of “Redesigning the National Assessment of Educational Progress” only in its complicated design, but also in an increasingly complex governance structure. Amendments to the authorizing statute for NAEP in 1988 established the present structure. Under the structure, the Commissioner of Education Statistics, who heads the National Center for Education Statistics (NCES) in the U.S. Department of Education, retains responsibility for NAEP operations and technical quality control. NCES procures test development and administration services from cooperating private companies; currently, they are the Educational Testing Service and WESTAT. The program is governed by the National Assessment Governing Board (NAGB or Governing Board), appointed by the Secretary of Education but independent of the department. The Governing Board, which is authorized to set policy for NAEP, is designed to be broadly representative of NAEP's varied audiences. It selects the subject areas to be assessed and ensures that content is planned though a national consensus process; the Governing Board currently contracts with the Council of Chief State School Officers for national consensus development. In addition, the Governing Board identifies achievement standards for each subject and grade tested, in conjunction with its contractor, the American College Testing Program; it also develops guidelines for reporting. Previously, many of these functions were carried out by advisers to NCES's cooperative test development agencies. NAGB's authority to oversee NAEP and give direction to NCES and the cooperative agencies parallels that of the Commissioner of Education Statistics to direct and execute the program. The U.S. Department of Education recently commissioned a review of NAEP's management and methodological procedures. That review concluded that confusion over the management structure of NAEP has complicated the program, slowed operations, and increased assessment costs (KPMG Peat Marwick LLP and Mathtech, Inc., 1996). Tension between NAGB and NCES and consensus-based decision making also were said to contribute to these problems. PROPOSED REDESIGN NAEP has chronicled educational performance for over a quarter of a century. It has been an unparalleled source of information about the academic proficiency of U.S. students, providing among the best available trend data on the academic achievement of elementary, middle, and secondary students in core subject areas. In addition, NAEP has distinguished itself in setting an innovative and rigorous agenda for conventional and performance-based testing. Because NAEP has been a leader in American testing, it is imperative that its redesign honor this tradition of excellence. In its redesign proposal, the Governing Board concludes that “in its current form, the National Assessment provides too little information, too infrequently and too late.” Our committee agrees with this conclusion. We believe three problems drive these difficulties: its unattainably broad measurement agenda, a resultingly complicated design, and confusion over management and oversight responsibility. The committee notes that these problems have been described in various commentaries by other professional groups concerned about the redesign of NAEP (e.g., Forgione, 1996; Glaser et al., 1996; Johnson, 1996; KPMG Peat Marwick LLP and Mathtech, Inc., 1996; Porter and Kilgore, 1996). The guiding principles for the NAEP redesign listed in the May 1996 Governing Board draft proposal state that the new assessment should: test annually according to a publicly released schedule, provide state-level results in reading, writing, math, and science at grade 4 and grade 8 according to a predictable schedule, use performance standards for reporting whether student achievement is “good enough,” use international comparisons where feasible, help states and others link their assessments with the National Assessment, vary the amount of detail in testing and reporting, simplify the National Assessment test design, keep test frameworks and specifications stable for at least 10 years, simplify how student achievement trends are reported, emphasize grade-based reporting over age-based reporting, make use of innovations in testing and reporting, and use an appropriate mix of multiple choice and performance test questions. (See the appendix for the full draft of the NAGB proposal; a slightly modified version was adopted by NAGB on August 2, 1996.)
OCR for page 3
Evaluation of “Redesigning the National Assessment of Educational Progress” EVALUATION OF THE PROPOSED REDESIGN We commend the National Assessment Governing Board for reviewing and seeking to improve the current program. The committee supports a number of elements in the Governing Board's redesign proposal. We agree with the desire to accelerate the reporting schedule after testing and to provide more comprehensible results to policy makers and the public. We also agree that the availability of public and predictable schedules for the main, trend, and state assessments is important for planning in numerous policy arenas. We applaud the intention to strengthen the high school data collections. And, like NAGB, we see merit in exploiting new technologies for NAEP to increase the efficiency and accuracy of the assessment. The program described by the Governing Board's redesign proposal is laudable in many respects. Overall, however, the document is an amalgam of disparate needs and elements, and it places an inordinate faith in the undefined concept of simplification. Although it recognizes that NAEP has been asked to do “more and more beyond its central purposes,” it refrains from serious discussion of the hard political and technical choices that are needed. Our concern is less with any specific element than with the assemblage of elements, less with any given goal than with the lack of clear priorities and the lack of detail about how the goals might be achieved. This finding is the basis for the committee's recommendation that redesign measures undertaken now be considered interim solutions —steps along the way to a fundamental rethinking of NAEP. Multiple and Varied Purposes In a minority statement to the Alexander and James report (1987), Linda Darling-Hammond presaged our reaction to the 1996 redesign proposal: The effort to make NAEP data useful for a greater range of purposes will undermine the assessment's capacity to perform its basic mission effectively. There is a delicate balance between developing a first-rate assessment of what the nation's students know and can do and attempting to negotiate a multi-purpose testing and data collection effort that may satisfy many objectives superficially but none of them well (p. 31). It is something of a truism to say that testing has become the victim of its own success. Federal, state, and local policy makers, education administrators, curriculum specialists, educators, researchers, the business community, media, parents, students, and the general public all have legitimate interests in the status of U.S. education. Student achievement data have become the indicator of choice—to gauge the impact of federal and state investment in education, to make judgments about teacher effectiveness or school quality, to review the effectiveness of programs, to evaluate educational innovations, as accountability measures, for classroom feedback, for individual credentialing, and for international comparisons. Congress, the Department of Education, the National Center for Education Statistics, and the National Assessment Governing Board have all succumbed to the growing desire for more and more information about student achievement, and Darling-Hammond 's cautionary advice notwithstanding, they are asking NAEP to provide it all. The committee concludes that this underlying desire cannot be met by NAEP: the universe of possible interests cannot be served simultaneously and well by the same assessment. In the most recent reauthorization of the National Assessment (Improving America's Schools Act 1994, P.L. 103-382), Congress mandated that it should: . . . provide a fair and accurate presentation of educational achievement in reading, writing, and other subjects included in the third National Education Goal, regarding student achievement and citizenship. To implement this charge, the Governing Board adopted three objectives for NAEP: to measure national and state progress toward the third National Education Goal and provide timely, fair and accurate data about student achievement at the national level, among states, and in comparison with other nations; to develop through a national consensus, sound assessments to measure what students know and can do as well as what they should know and be able to do; and to help states and others link their assessments to National Assessment and use National Assessment data to improve education performance. This agenda has been constructed over the last 8 years. It is the crux of the problem. While each of these objectives is in itself a worthy goal, collectively they have produced a testing program that everyone admits is overburdened and excessively complex. The failure to adopt a workable set of priorities—plus the addition of ambitious plans for annual administrations and additional subjects —suggests that the redesign has
OCR for page 3
Evaluation of “Redesigning the National Assessment of Educational Progress” the potential to continue and perhaps exacerbate the problems it is seeking to solve. The tension between assessment for national and state education goals epitomizes the committee's concern about the many and diffuse purposes of the national assessment. Without question, there is great public interest in the progress of states toward self-determined education goals and standards, and this interest has, over the years, moved NAEP in a direction that better serves the states in their pursuit of information to evaluate progress. At the same time, the addition of the trial state assessments to NAEP necessitated significant accommodations in the design, scheduling, and reporting of assessments and their results. It is the relative cost and benefit of accommodations of this sort that the committee believes must receive more careful scrutiny in the redesign of NAEP. For example, the sampling framework to support inferences about state-level performance differs from that required for inferences about the nation as a whole, resulting in separate samples being drawn for national and state NAEP. A natural response to this apparent duplication is discussed in the May draft; it proposed developing new sampling methods so that both kinds of inferences can be supported by a single sample. Several alternatives for such combined sampling have been reviewed in the interim by the Design and Feasibility Team commissioned by NAGB. Its evaluation of alternatives accentuates technical difficulties that would arise in the areas of equating, content sampling, and participation for national versus state NAEP (Forsyth et al., 1996). In the final policy statement, NAGB steps away from combined sampling as a viable solution to the problem of trying to reconcile two very different, demanding objectives. This does not resolve the dilemma, however, of trying to serve both national and state needs adequately. Other conflicts between needs for national and state assessments are less easy to anticipate, but important to consider. To the extent that NAEP moves toward greater focus on the states, one might expect increased interest in the degree to which the curriculum frameworks developed at the national level accurately characterize state curricula, education goals, and standards. Although the procedures in place for developing curriculum frameworks yield broadly representative specifications for test content, they are not designed for alignment with particular curricula and standards. Whether states will want them to become so aligned is a matter of state education policy, but that alignment is a critical component of the validity of inferences based on NAEP. The juxtaposition of competing values for national and state assessment focuses attention on the need for informed discussion of the central purposes for national assessment. It also makes manifest the complexity of further extensions of NAEP to achieve, for example, valid international benchmarks for performance or of including in the assessment some populations (e.g., students with various disabilities) that may require selection of exercises and other alterations to accommodate their special situations. Insufficient Detail The very general nature of the Governing Board's redesign document makes it difficult to evaluate its feasibility. The redesign proposal lacks specificity and detail with respect to the new assessment's design, administration, analysis, and reporting schemes. Fruitful debate about the redesign objectives will require more information about the feasibility and likely psychometric characteristics of the assessment envisioned. The proposal does not specify how the redesign objectives will be achieved or at what cost —in terms of validity, reliability, and timeliness, as well as, funding To give but one example, detail is lacking on the means by which trend data would be collected through the main assessment. The redesign proposal states that it may be impractical and unnecessary to operate two separate testing programs and that a plan should be developed to allow the main assessment to become the primary way to measure trends in reading, writing, mathematics, and science. It does not explore in any depth options for combining the main and trend data collections, nor does it describe a process for analyzing and deciding among alternatives. It does, however, state the Governing Board' s intention to do away with the trend assessment after “. . . a carefully planned transition . . .” (p. 7). Collapsing the main and trend data collections would be very difficult for a number of reasons. The most obvious obstacle is that the content frameworks for the two assessments are different. The likelihood of even minor changes in frameworks and items raises questions about the validity of trend lines that would be based on the main assessment. The proposal to combine the trend and main assessments would thus jeopardize NAEP's continuity over time and thereby undermine what is, in the committee's view, one of
OCR for page 3
Evaluation of “Redesigning the National Assessment of Educational Progress” NAEP's unique and most valuable features. Once broken, the chain of evidence is irretrievable. The Committee recommends that a separate collection of NAEP trend data be continued. This recommendation comports with our recommendation that the contemplated redesign be viewed as a set of limited, interim solutions, including no elements that could compromise a more ambitious reconceptualization in the future. Both the Department of Education and the National Assessment Governing Board will be aided in thinking about how the redesign proposal might be operationalized by the recent report of the Design and Feasibility Team (Forsyth et al., 1996). This group of leading measurement experts was asked by the Governing Board to suggest operational alternatives for the redesign objectives. It is important to note that the focus of the study was not on the feasibility and impact of specific changes —or more importantly, the constellation of changes—considered by the Design and Feasibility Team. These are as yet largely unknown. The research undertaken by NAGB and NCES in coming months will need to address these questions. Insufficient Empirical Base NAEP's managers seek to simplify the program in ways that would release funds for more frequent assessment, additional subject testing, accelerated reporting, and other enhancements. However, streamlining NAEP's measurement and administration designs in accordance with the proposal will be exceedingly difficult, both conceptually and technically. A number of issues work against parsimony. First, the policy framework for NAEP is unclear; parsimony rests, at least in part, on clarity of purpose. Second, NAEP's measurement and design properties are very complex, so that many of NAEP's findings derive from sequential calculations. Third, the expansion of NAEP's client base to state testing offices intensifies, rather than simplifies the burdens of NAEP for sampling, administration, analysis, and reporting. Answers to some of these outstanding issues may come during the next 6 months when several commissioned papers on critical features of the present NAEP will be issued. However, the current schedule for letting contracts to carry out elements of the plan may preclude the possibility for the redesign to be influenced by three such studies: Quality and Utility: The 1994 Trial State Assessment in Reading (1996) and Capstone Report of the National Academy of Education Panel on the Evaluation of NAEP (in press), both coming from the National Academy of Education, and NAEP Validation Studies white papers (National Center for Education Statistics, no date). Another important document, the KPMG Peat Marwick LLP and Mathtech, Inc. (1996) Review of the National Assessment of Educational Progress: Management and Methodological Procedures, was only recently released: it addresses feasibility issues with direct bearing on the redesign proposal, but was not available when NAGB's proposal was being written. Among its important conclusions was the fact that several of NAGB's widely discussed revisions are unlikely to provide cost savings. Given the importance of assumed cost reductions through simplification to the expansive elements in the redesign proposal (e.g., more frequent testing), this conclusion is troubling. Decisions about simplification should represent an informed balance of estimated costs and proposed benefits, and they should be based on understanding of the implications for the technical integrity of the test instruments in light of the purposes currently espoused. The committee recommends that the U.S. Department of Education evaluate the cost implications of specific aspects of the redesign proposal. Despite the impressive work of the Design and Feasibility Team, KPMG Peat Marwick and Mathtech analysts, and others, several basic premises of the proposed redesign are as yet unsupported by empirical data. While the committee recognizes that NAEP cannot lie dormant while all relevant research is conducted, the feasibility of major changes should be explored prior to their approval. Such specific issues as sample size, test content, length, item format mix, and scoring procedures merit empirical investigation. Provisional estimates of the effects of various proposed design changes can be obtained by analyses of existing NAEP data. Among the questions that would reward empirical analysis are the following: How would recommended changes affect the reliability and validity of cross-sectional data? How would changes in the numbers and types of items affect the reliability and validity of trend information? Would combining trend and cross-sectional assessments compromise the measurement of long-term trends? Are results for the developmental achievement levels sufficiently accurate and informative that they should be operationally adopted?
OCR for page 3
Evaluation of “Redesigning the National Assessment of Educational Progress” In addition, there are a number of other questions for which research has long been needed. Alternatives for enhancing student motivation on a low-stakes exam warrant study, as do strategies for increasing the value of NAEP information to various users. As accountability and assessment become more integrally linked in state testing programs, questions about possible changes in test preparation and test performance also become important. Extending assessment to students with special learning needs and those with limited English proficiency presents challenges in test development, analyses, and reporting. Answers to these questions will be critical to a useful redesign of NAEP. The committee understands that NAGB intends to move ahead with strategies for revision of NAEP while planning to contract for research on the questions for which empirical data are needed. This approach raises concern that fundamental policy issues about national and state assessment would be decided in negotiations with a variety of contractors. Although such an approach might yield creative solutions, it also affords little protection against conflicts of interest and does not allow for a broad view and high-level policy attention to the cohesiveness of the program and the integrity of NAEP. FUNDAMENTAL RECONCEPTUALIZATION OF NAEP We recommend that the National Assessment Governing Board and the U.S. Department of Education consider the NAGB redesign proposal as a range of possible interim measures to alleviate some of the immediate pressures on NAEP while undertaking a more fundamental rethinking of NAEP's goals and character. The advantages of thinking in terms of interim solutions are several. First, this approach defines some bounds for the redesign in terms of time, expenditures, and expectations. A deliberate decision to work on interim solutions would suggest avoiding changes that could inadvertently damage or constrain future options when a comprehensive redesign is undertaken. Such goals as shorter reporting times, a known, regular schedule of administrations, and simpler, more comprehensible reports would seem to satisfy this criterion. Second, viewing the redesign as an interim measure helps clarify what is workable at this stage. The redesign proposal does not make any major adjustments to the avowed purposes of NAEP; as a consequence, any simplifications in test design, sampling, or administration should be limited to those that will not compromise the quality of the data that federal and state policy makers assume is present. And finally, accepting such limited goals provides time in which to engage in a fundamental rethinking of NAEP and its purposes. We urge a modest approach in the current design, but we are also strongly convinced of the need for the National Assessment Governing Board, the National Center for Education Statistics, and the Congress to embark on a process of rethinking the National Assessment of Educational Progress from the ground up. Paramount among issues that require close examination are the assessment's purpose and measurement objectives. NAEP has been called on to inform policy debates about the academic achievement of U.S. students, equality of educational opportunity, human resource issues, school effectiveness, and the attainment of forward-looking performance standards. There is general agreement that this is a mandate no single testing program can fulfill. Officials in Congress, the Executive Branch, the states, and members of the Governing Board need to make choices. These political choices need to be informed by a careful weighing of technical options and information needs. For example, for policy makers to decide if they should make it a top priority for NAEP to be focused on state testing, questions about linkage—to state frameworks, state assessments (both conventional and performance based), and state reporting requirements—would be pressing. The means by which links could be made are unknown. More extensive research on linking or comparability would be needed, in combination with deliberation about the types and levels of service NAEP could support under various funding assumptions. Moreover, problems associated with participation rates and desired inferences to smaller sampling units would need to be specifically addressed. The use of technology to streamline certain aspects of the national assessment should be considered in expanding the redesign initiative. For example, NAEP has successfully implemented new scanning technologies to create image databases that increase the efficiency of scoring open-ended exercises. Other such uses of technology to streamline operations have obvious appeal with regard to costs and timely reporting. Less obvious, however, is the gain to be achieved by computerized test delivery and adaptive testing, the former because of the comparative low cost of paper-and-pencil tests and the latter because adaptive methods are not well suited for summarizing aggregate performance across domains. Just how technology can be
OCR for page 3
Evaluation of “Redesigning the National Assessment of Educational Progress” used to cut costs and improve measurement in NAEP requires careful evaluation and planning. Decisions about the long-term future of NAEP also need to take account of likely new directions in testing and assessment. For example, understanding of knowledge structures and the way individuals acquire and represent knowledge is very different from what it was when NAEP and other large-scale testing programs began (Glaser et al., 1992; Gifford and O'Connor, 1992; Wittrock and Baker, 1991). Scientific information about learning and cognition is only beginning to be applied to test design in the United States. But there is every reason to think that the burgeoning sciences of learning and cognition will have a significant impact on education and, therefore, on assessment as well. In keeping with its role as a leader in assessment, NAEP in the twenty-first century should grow out of the science of learning. CONCLUSION A common perception of those who have watched NAEP over the years is of an ever-expanding black box with contents that are thoroughly understood by an ever-shrinking number of specialists. The Governing Board's proposal adjusts the dimensions of the box, shakes its contents, and smoothes some of its rougher edges. However, it stops short of examining what belongs inside. The committee believes that NAGB' s proposal to redesign NAEP should represent an initial step in reconsideration of the fundamental purposes and practices of the national assessment. NAEP's prominence in American education, its allure as a palliative in addressing education's ills, and the diverse interests of legitimate stakeholders all operate to make its mission diffuse. NAEP's complex design and cumbersome management and governance structure mirrors its many purposes. The committee believes that NAEP's reconceptualization should directly address the fundamental tensions among measurement purposes.