2

Streamlining the Design of NAEP

Summary Conclusion 2. Many of NAEP's current sampling and design features provide important, innovative models for large-scale assessments. However, the proliferation of multiple independent data collections—national NAEP, state NAEP, and trend NAEP—is confusing, burdensome, and inefficient, and it sometimes produces conflicting results.

Summary Recommendation 2. NAEP should reduce the number of independent large-scale data collections while maintaining trend lines, periodically updating frameworks, and providing accurate national and state-level estimates of academic achievement.

INTRODUCTION

NAEP provides important information about the academic achievement of America's youth, and the assessment has many strong design features. For example, NAEP's sampling, scaling, and analysis procedures serve as important models for the measurement community. The frameworks and innovative assessment materials serve as guides for state and local standards and assessment programs, and state NAEP results provide a useful backdrop for state and local assessment data.

In this chapter, we describe and evaluate NAEP's current sampling, data collection, analysis, and reporting methods. As background, we review the current



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress 2 Streamlining the Design of NAEP Summary Conclusion 2. Many of NAEP's current sampling and design features provide important, innovative models for large-scale assessments. However, the proliferation of multiple independent data collections—national NAEP, state NAEP, and trend NAEP—is confusing, burdensome, and inefficient, and it sometimes produces conflicting results. Summary Recommendation 2. NAEP should reduce the number of independent large-scale data collections while maintaining trend lines, periodically updating frameworks, and providing accurate national and state-level estimates of academic achievement. INTRODUCTION NAEP provides important information about the academic achievement of America's youth, and the assessment has many strong design features. For example, NAEP's sampling, scaling, and analysis procedures serve as important models for the measurement community. The frameworks and innovative assessment materials serve as guides for state and local standards and assessment programs, and state NAEP results provide a useful backdrop for state and local assessment data. In this chapter, we describe and evaluate NAEP's current sampling, data collection, analysis, and reporting methods. As background, we review the current

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress NAEP assessments, the sampling designs and analysis methods used, and the reports generated. We then briefly review the findings of previous evaluations and provide our own evaluation of the strengths and weaknesses of the current design. Our conclusions lead us to recommend strengthening NAEP's design and increasing its usefulness. We argue for reducing the number of independent large-scale data collections currently carried out. We discuss and provide proposals for: Combining the trend NAEP and main NAEP designs in core subjects to preserve measurement of trends and allow updating of frameworks; Using more efficient sampling procedures for national NAEP and state NAEP in order to reduce the burden on states and schools, decrease costs, and potentially improve participation rates; Using multiple assessment methods to assess subject areas for which testing frequency generally prohibits the establishment of trend lines; Exploring alternatives to the current assessment of twelfth graders by NAEP with the goal of minimizing bias associated with differential dropout rates and the differing course-taking patterns of older students, encouraging student effort, and expanding assessment domains to include problem solving and other complex skills critical to the transition to higher education, the workplace, and the military; and Improving NAEP reports by providing (1) descriptive information about student achievement, (2) evaluative information to support judgments about the adequacy of student performance, and (3) contextual, interpretive information to help users understand students' strengths and weaknesses and better investigate the policy implications of NAEP results. OVERVIEW OF NAEP'S CURRENT SAMPLING, DATA COLLECTION, ANALYSIS, AND REPORTING PROCEDURES The National Assessment of Educational Progress is mandated by Congress to survey the academic accomplishments of U.S. students and to monitor changes in those accomplishments over time. Originally, NAEP surveyed academic achievement and progress with a single assessment; it has evolved into a collection of assessments that now includes the trend NAEP and main NAEP assessments. Main NAEP has both the national and state components. National NAEP includes the large-scale survey assessments and a series of special studies that are not necessarily survey-based. Special studies generally focus on specific portions of NAEP's subject domains and on the associated teaching and learning data. Current NAEP is described in the Introduction; Figure I-1 shows the components of the current program.

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress Components of Current NAEP The primary objective of trend NAEP is to provide trend lines of educational achievement for the U.S. population and major population subgroups over extended time periods. To avoid disruptions in trend lines caused by differences in NAEP administration or content, administration procedures and assessment items for trend NAEP are held as constant as possible over time. Main NAEP is a larger assessment program than trend NAEP; it provides more precise estimates of educational achievement in population subgroups, includes more contextual variables, and is based on frameworks that are updated on a regular basis to reflect changes in curriculum and pedagogical thought. Again, main NAEP includes both national and state components. The state data collections are structured to provide estimates with adequate degrees of precision for individual states. Tables 2-1 through 2-3 summarize the administrations of current NAEP since 1984, with assessments based on the same frameworks indicated by the same symbol and joined by lines to indicate whether trend estimation is feasible. Note that, in addition to the trend lines established using trend NAEP, short-term trend lines for main NAEP have been established in reading in national NAEP (grades 4, 8, and 12) and state NAEP (grade 4) from 1992 to 1998. Short-term trend lines from 1990 to 1996 have also been established in mathematics in national NAEP (grades 4, 8, and 12) and state NAEP (grade 8; and for 1992-1996, grade 4). However, as noted previously, the short-term trend lines of national NAEP and state NAEP reflect different assessment materials and student samples than does trend NAEP. NAEP's multiple assessment programs evolved to preserve trend lines, at the same time allowing for updating of NAEP frameworks, and to obtain state-level NAEP estimates in main NAEP. The distinct programs allow the objectives of each component to be achieved without compromising the aims of the others. However, it may be unnecessary to have separate assessment programs with such similar objectives. Later in this chapter, we consider whether there is a compelling need for distinct assessment programs or whether these activities could be merged. Sampling Designs for Current NAEP The NAEP program differs fundamentally from other testing programs in that its objective is to obtain accurate measures of academic achievement for populations of students rather than for individuals. This goal is achieved using innovative sampling, scaling, and analysis procedures. We discuss these procedures next. Note that their description and evaluation is reliant on technical terminology that is difficult to translate into nontechnical terms. Technical language

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress TABLE 2-1 NAEP Frameworks, Designs, and Samples by Discipline and Year Tested: Reading and Writing

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress TABLE 2-2 NAEP Frameworks, Designs, and Samples by Discipline and Year Tested: Science and Mathematics

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress TABLE 2-3 NAEP Frameworks, Designs, and Samples by Discipline and Year Tested: Geography, History, and Civics

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress is used in this chapter in a way that is atypical of the remainder of this report. NAEP tests a relatively small proportion of the student population of interest using probability sampling methods. Constraining the number of students tested allows resources to be devoted to ensuring the quality of the test itself and its administration, resulting in considerably better estimates than would be obtained if all students were tested under less controlled conditions. The use of sampling greatly reduces the burden placed on students, states, and localities in comparison to a national testing program that tests a substantial fraction of the nation's children. The national samples for main NAEP are selected using stratified multistage sampling designs with three stages of selection. The samples since 1986 include 96 primary sampling units consisting of metropolitan statistical areas (MSAs), a single non-MSA county, or a group of contiguous non-MSA counties. About a third of the primary sampling units are sampled with certainty, and the remainder are stratified and one selected from each stratum with the probability proportional to size. The second stage of selection consists of public and nonpublic schools within the selected primary sampling units. For the elementary, middle, and secondary samples, independent samples of schools are selected with probability proportional to measures of size. In the final stage, 25 to 30 eligible students are sampled systematically with probabilities designed to make the overall selection probabilities approximately constant, except that more students are selected from small subpopulations, such as private schools and schools with high proportions of black or Hispanic students, to allow estimates with acceptable precision for these subgroups. In 1996 nearly 150,000 students were tested from just over 2,000 participating schools (Allen et al., 1998a). The sampling design for state NAEP has only two stages of selection —schools and students within schools—since clustering of the schools within states is not necessary for economic efficiency (Allen et al., 1998b). In 1996 for each state, approximately 2,000 students in 100 schools were assessed for each grade. Special procedures were used in states with many small schools for reasons of logistical feasibility. The national and state designs limit students to one hour of testing time, since longer test times are thought to impose an excessive burden on students and schools. This understandable constraint limits the ability to ask sufficient questions in the NAEP subject areas to yield accurate assessments of ability for individual students or subareas in a discipline. Time limits and NAEP's expansive subject-area frameworks have led to students receiving different but overlapping sets of NAEP items, using a form of matrix subsampling known as balanced incomplete block spiraling. The data matrix of students by test questions formed by this design is incomplete, yielding complications for the analysis. The analysis is currently accomplished by assuming an item response theory model for the items and drawing multiple plausible values of the ability parameters for sampled

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress students from their predictive distribution given the observed data (Allen et al., 1998a). The school and student sampling plan for trend NAEP is similar to the design for national NAEP. Schools are selected on the basis of a stratified, three-stage sampling plan with counties or groups of contiguous counties defined by region and community type and selected with probabilities proportional to size. Public and nonpublic schools are then selected. In stage three, students within schools are randomly selected for participation. Within schools, students are randomly assigned to either mathematics/science or reading/writing assessment sessions, with item blocks assigned using a balanced, incomplete design. In 1996, between 3,500 and 5,500 students were tested in mathematics and science and between 4,500 and 5,500 in reading and writing (Campbell et al., 1997). Analysis Methods for Current NAEP Standard educational tests generally involve a large enough set of items to allow an individual student's proficiency on a tested topic to be captured with minor error from a simple summary, such as a total score or average test score. Since everyone takes the same test (or if different versions are used, the alternatives are carefully designed to be parallel), scores from different students can be compared directly and distributions of ability estimated. It was found that these simple approaches to analysis did not work well for the NAEP assessments since the tests are short, and they contain relatively heterogeneous items so that, in combination, multiple test forms capture NAEP subject areas adequately. As a result, simple summary scores for NAEP have sizable measurement error, and scores from different students can vary significantly because of differences in the items appearing on individual test forms. The analysis for main NAEP and trend NAEP needs a glue in order to patch together results from heterogeneous forms assigned to heterogeneous students into clear pictures of educational proficiency. The glue of current NAEP analysis is supplied by item response theory modeling (IRT), which captures heterogeneity in items through item parameters and heterogeneity between students through individual student proficiency parameters. The basic forms of IRT used are the three-parameter logistic model (Mislevy et al., 1992) for multiple-choice or other right/wrong items and the generalized partial credit model of Muraki (1992) for items for which more than one score point is possible. Parameters are estimated for sets of homogeneous items by the statistical principle of maximum likelihood using the NAEP bilog/parscale program, which accommodates data in the form of the matrix samples collected (Allen et al., 1998a). A variety of diagnostic checks of these models are carried out, including checks of the homogeneity of the items (unidimensionality), goodness of fit of the models to individual items, and checks of cultural bias suggested by residual subgroup differences for students with similar estimated proficiencies.

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress The IRT models relate main NAEP and trend NAEP items to a set of K scales of unobserved proficiencies (Allen et al., 1998a). Each sample individual j is assumed to have a latent (K × 1) vector of unobserved proficiencies Θj, the values of which determine the deterministic component of responses to items related to each scale. Given the estimates of item parameters, the predictive distribution of each individual student's Θj can be estimated based on the observed performance on the items. This predictive distribution is multivariate and conditioned on the values of fixed background variables characterizing the student. For each student j, five sets of plausible values (Θj1,…,Θj5) are drawn from this predictive distribution. Five sets are drawn to allow the uncertainty about the latent proficiencies, given the limited set of test questions, to be reflected in the analysis. This step is an application of Rubin's (1987) multiple imputation method for handling missing data and is called the plausible values methodology in the NAEP context (Mislevy, 1985). Once plausible values are imputed for each individual student on a common scale, inferences can be drawn about the distribution of proficiencies, and proficiencies can be compared between subgroups and over time. For main NAEP, cutscores along the proficiency scales can also be determined to reflect levels of performance that are judged to represent basic, proficient, and advanced achievement. Statistics of interest, such as proficiency distributions for the current NAEP samples and for subgroups defined by demographic characteristics, can be regarded as functions of aggregates of predicted latent proficiencies and student characteristics g(Θj,yj) for each student j. As in the analysis of many probability surveys, sampled individuals who contribute to the aggregate statistics are weighted to allow for differential inclusion probabilities arising from sample selection, unit nonresponse adjustments, and poststratification. The sampling variance of estimates, initially ignoring uncertainty in the Θj, is computed by jackknife repeated replication, an established method for computing sampling errors from surveys that take into account the stratification, clustering, and weighting of the complex sample design (Kish and Frankel, 1974). The uncertainty in the Θj is then incorporated by adding to the average jackknife sampling variance of the statistic computed for each set of plausible values {Θ(j,k):k = 1,…,j}, a component of imputation variance based on the variability of the estimates computed from each set of plausible values. This computation is an application of Rubin's (1987) multiple imputation method. NAEP Reporting From the program's inception, NAEP has had the goal of reporting results in formats that are accessible to potential users, promote valid interpretations, and are useful to NAEP's varied constituencies. The NAEP program currently produces an impressive array of reports, including:

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress Report Cards. These are the primary reports of the results of main NAEP. Results are presented for the nation, for states (if applicable), for major demographic groups, and in relation to key context variables (e.g., for public and private schools). State Reports. These report results from main NAEP, with a report tailored specifically for each participating state. Focus on NAEP/NAEP Facts. These are two-or four-page mini-reports that summarize NAEP frameworks, assessment results, and address topics of current and special interest. Instructional Reports. These show performance data in relation to instructional background variables; they are issued 6 to 12 months after the Report Cards. Focused Reports. These contain NAEP results from the special studies component of main NAEP (e.g., on the performance of English-language learners and students with disabilities or on special features of the assessments). These are also issued 6 to 12 months after the Report Cards. Trends in Academic Progress. This is the primary report of the results of trend NAEP. This differentiated product line is intended to serve a variety of audiences, with differing information needs, interest in findings, and sophistication in interpreting results. SELECTED FINDINGS FROM PREVIOUS NAEP EVALUATIONS Components of Current NAEP Again, main NAEP and trend NAEP test different student populations and use distinct assessment exercises and administration procedures. The national and state components of main NAEP also use different administration procedures. There is a good deal of sympathy among policy makers, testing experts, and NAEP's evaluators for the need to streamline NAEP's designs (National Academy of Education, 1996, 1997; Forsyth et al., 1996). NAEP's policy board, the National Assessment Governing Board (NAGB), has expressed concern over the inefficiency of maintaining main NAEP and trend NAEP; they recently announced plans to investigate more efficient design options. They said ''[it] may be impractical and unnecessary to operate two separate assessment programs.'' They have called for a "carefully planned transition … to enable the main National Assessment to become the primary way to measure trends in reading, writing, mathematics, and science in the National Assessment program" (National Assessment Governing Board, 1996:10). NAGB also registered concern about the inefficiency and burden imposed on states by separate state and national NAEP data collections. To address this concern for future

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress assessments, NAGB has said that "where possible, changes in national and state sampling procedures shall be made that will reduce burden on states, increase efficiency, and save costs" (National Assessment Governing Board, 1996:7). Sampling Designs for Current NAEP As we do later in this chapter, the National Academy of Education (NAE; 1992, 1993, 1996), KPMG Peat Marwick LLP and Mathtech (1996), the Design/Feasibility Team (Forsyth et al., 1996), and others have examined the sampling designs for NAEP. The NAE panel focused on the conduct and results of the state component of main NAEP in 1990, 1992, and 1994, reviewing sampling and administration practices for the state assessments. KPMG Peat Marwick and the Design/Feasibility Team examined both the national and state programs. The National Academy of Education panel found that sampling procedures for the state assessment program were consistent with best practice for surveys of this kind and concluded that sampling and administration were done well for the state program (National Academy of Education, 1996). They expressed concern, however, about declining school participation rates as the program progressed and recommended that the National Assessment Governing Board (NAGB) and the National Center for Education Statistics (NCES) consider design changes to decrease sample size requirements or otherwise reduce the burden on states, particularly small states. They warned that heavy program requirements might threaten school and state participation rates, particularly in years when multiple subjects and grades are tested. The panel cautioned that diminished participation in the state program might have deleterious effects on national NAEP. They and others have reviewed school and student sampling for national NAEP and concluded that the national samples are drawn by experienced staff using well-established scientific, multistage stratified probability sampling designs (KPMG Peat Marwick LLP and Mathtech, 1996). As noted earlier, the sampling design for trend NAEP parallels that for national NAEP. As explained above, NAEP's inclusive frameworks require that a balanced incomplete block design be used for test administration. Although reviewers applaud the ingenuity of the design, some worry about the complexity and fragility of the analytic machinery the design necessitates (National Academy of Education, 1996). The NAEP program has been urged to explore alternatives for simplifying the design. The NAE panel warned that the frameworks for main NAEP push the limits of form design and may strain current methods, particularly in light of recent pressure to hasten scaling, analysis, and reporting. Reviewers point to anomalies in NAEP findings as indicators of design stress and call for research to develop a more streamlined design for the assessment (U.S. General Accounting Office, 1993; National Academy of Education, 1993; Hedges and Venesky, 1997).

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress FIGURE 2-5 Mean NAEP mathematics scores (eighth grade and age 13). FIGURE 2-6 Mean NAEP mathematics scores (twelfth grade and age 17).

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress 12 most obvious data comparisons across grades and disciplines (the three grades and four time periods from 1990 to 1992 and from 1992 to 1996 in mathematics and from 1988 to 1990 and from 1992 to 1994 in reading), four periods show similar results on national NAEP and trend NAEP and eight show dissimilar results.1 Replication is useful for uncovering methodological inconsistencies, but it is not obvious what conclusions can be drawn from discordant results across somewhat different designs. Meaningfulness of Trend Frameworks Trend NAEP is designed to keep changes in design, administration, and questions to a minimum. The anomalies in the reading results for 1986 and 1994 NAEP (Zwick, 1991; Hedges and Venesky, 1997) demonstrated that very modest changes in data collection and assessment procedures can have unexpectedly large effects on assessment results. Analyses of the 1986 incident, including a set of randomized experiments built into the subsequent (1988) assessment, led measurement specialists to conclude that if you want to measure change, don't change the measure. Despite the obvious wisdom of this approach in the short run, it may have some drawbacks over longer periods of time. It is not inconceivable that, held constant for long periods of time, frameworks become increasingly irrelevant by failing to reflect changes in curricula and instructional practice. An increasing gap between assessment and practice could make estimated trends from assessments built to old frameworks potentially misleading. We examined this assertion in an attempt to push our thinking about design alternatives. We commissioned research to assess the relevance of NAEP trend items to current standards and instructional practice (Zieleskiewicz, 1999). Middle school teachers and disciplinary specialists were asked to examine a set of trend NAEP materials and main NAEP items to determine their relevance to current curriculum and instruction in mathematics and science. Respondents were asked about the extent to which students currently have opportunities to master the knowledge and skills addressed by the items. They also relayed their perceptions of the relevance of the trend NAEP and main NAEP items to national disciplinary standards. Zieleskiewicz sought the views of teachers in states on the vanguard of standards-based reform and in a randomly selected group of states. She also surveyed disciplinary specialists active in mathematics and science reform at the national level. The resulting data are described and summarized in a volume of papers commissioned to inform our evaluation (National Research Council, 1999). The 1   Two of the inconsistencies in national NAEP and trend NAEP data may be attributable to anomalies in the 1994 reading results for grades 4 and 12.

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress data show that teachers and disciplinary specialists rated trend NAEP and main NAEP items similarly on students' opportunity to learn tested knowledge and skills and on their relevance to current curricula and national standards. That is, in this trial and on these dimensions, disciplinary specialists and middle school mathematics and science faculty did not distinguish between trend items and items written to current frameworks. We do not know whether similar data would result for the other grade levels in mathematics or science of for other subject areas, but for this grade and these subjects, the data showed that trend and main NAEP items are similarly aligned with current practice and standards. The findings run counter to the common presumption that trend instrumentation is dated and bolster arguments for developing and maintaining a single trend line for current NAEP. The data are consistent with arguments for streamlining trend assessment for current NAEP. Costliness and Burden As we have said, the current NAEP designs involve separate samples, tests, and data collection procedures. This practice is costly, since it constitutes essentially three different data collection programs. Past evaluators have discussed direct costs and attempted to estimate indirect costs for the state and national designs (KPMG Peat Marwick LLP and Mathtech, 1996). Currently, assessment of two subjects and two grades by state NAEP is nearly as expensive as testing two subjects at three grades by national NAEP. In addition, the separate data collections place a burden on small and low-population states and large districts that may have had a deleterious effect on participation. Additional inefficiencies are associated with ongoing administration of assessments for every trend line the NAEP program supports. As currently configured, every cycle of trend NAEP administration, analysis, and reporting adds $4,000,000 to NAEP program costs. Merging the Main NAEP and Trend NAEP Designs Many assert that maintaining a statistical series is the most important thing NAEP does (Forsyth et al., 1996), and we agree that this should remain a major priority in the future. However, the current means for achieving this goal are inefficient and not without problems, as discussed above. It is the committee's judgment that trend and main NAEP should be reconfigured to allow accurate and efficient estimation of trends. Our conception of a combined design would accord main NAEP the more stable characteristics of trend NAEP in repeated administrations over 10-to-20 year time spans. The main objective would be to minimize the design flux that has characterized main NAEP, with the goal that it reliably assess not only current level but also trends in core subject areas. This proposal is consistent with the ideas about NAEP's redesign offered by NAGB (National Assessment Governing Board, 1997), the NAGB Design/Feasibility

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress Team (Forsyth et al., 1996), and the NAE panel (National Academy of Education, 1997). In a paper published in a volume that accompanies this report, Michael Kolen (1999a) offered a number of suggestions for phasing out the current trend data collection and continuing with main NAEP while maintaining a long-term trend line. As background for his proposals, Kolen discussed differences between the assessments, including variation in content, operational procedures, examinee subgroup definitions, analysis procedures, and results. In cataloguing differences between the two designs, Kolen explained that the content specifications for trend NAEP were developed and have been stable since 1983/1984 for reading and writing and 1985/1986 for mathematics and science, whereas the frameworks for main NAEP have evolved. He noted that trend NAEP has a higher proportion of multiple-choice than constructed-response items in comparison to main NAEP. In main NAEP, he said, students are given test items in a single subject area, and in trend NAEP students test in more than one subject area. Kolen also explained that main NAEP oversamples minority students to permit subgroup comparisons, but trend NAEP does not. Subgroup definitions also differ for the two designs. Main NAEP identifies students' race and ethnicity information from multiple sources, giving priority to student-reported information. Trend NAEP uses administrators' observations to designate students' race. Kolen noted the differences between grade-based sampling for main NAEP and age-based sampling and reporting for trend NAEP. After recounting differences between the two assessments, Kolen presented five designs for estimating long-term trends with NAEP and laid out the statistical assumptions, linking studies, and research required to develop and support the designs. In one design, Kolen proposed monitoring long-term trends with the main NAEP assessment and using overlapping NAEP assessments to initially link main NAEP to trend NAEP and then to link sequential assessments whenever assessment frameworks and/or designs are modified. He explained that implementation of this design relies on the conduct of research to estimate the effects of differences between subgroup and cohort definitions and administration conditions on main NAEP and trend NAEP. Research to examine the effects of content differences and differences in item types for trend NAEP, main NAEP, and successive assessments would also needed. Linking and scaling research would be needed initially to place main NAEP results on the trend scale or trend results on the main scale and, again, to continue the trend line as NAEP evolves. Because long-term trends would be assessed with main NAEP in this design, main NAEP must be more stable than it has been in the past, Kolen explained. In another design, Kolen suggested allowing main NAEP to change to reflect current curricula and use a separate trend assessment, with occasional updating, to maintain a trend line. With this design, modest changes in the content of trend NAEP would be allowed to ensure its relevance, but the operational conditions of

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress the assessment would remain constant. This design would allow for the replacement of some items in the trend instruments and alternate forms of the trend instruments would be equated. The design would continue to provide long-term trend estimates without an extensive research program, but it requires the continuation of both assessments. For Kolen's discussion of these and alternative models, see the volume of research papers that accompanies this report. Assessing NAEP Disciplines It is important to note that proposals for merging trend NAEP and main NAEP are limited to the large-scale assessments in reading, writing, mathematics, and science. We discuss this construction in greater detail below but, again, note that assessment of these disciplines using large-scale assessment methods is part of the core NAEP component of our proposal for new paradigm NAEP (see Chapter 1). If history, geography, or other disciplines are assessed frequently enough in the large-scale survey program to support trend estimation, these too would constitute core NAEP, but tracking trends back to the 1970s and 1980s would not be possible in these subjects. As we stated in Chapter 1, NAEP should address those disciplines for which testing frequency generally prohibits the establishment of trend lines using multiple assessment methods, rather than as components of the NAEP large-scale assessment program. This approach has two possible advantages: (1) by reducing scale and releasing resources, it enables more in-depth treatment of these subject areas and the teaching and learning opportunities that define them and (2) it affords more frequent measurement and trend estimation for core disciplines and may allow more thorough reporting in these subjects. We include the assessment of noncore subjects in our proposal for multiple-methods NAEP. Chapter 4 provides further discussion of this and other components of multiple-methods NAEP. High School Testing A number of conditions point to insufficient clarity about the meaning of results for high school examinees under the current designs. First, test administrators observe that some high school examinees do not make a serious effort to answer NAEP questions, rendering their scores of questionable value. The administrators' observations are corroborated by the high omit and noncompletion rates of 17-year-olds on trend NAEP and seniors on national NAEP. The nonresponse rates are particularly high on the constructed-response items for national NAEP. Despite concerted effort to date, the NAEP program and stakeholders have been unable to identify workable incentives for high school students' participation and effort. Second, the curricula of high school students are variable; course-taking

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress patterns are sufficiently variable that it is difficult to render judgments about students' opportunity to learn tested content, particularly for older high school students. Finally, differential dropout rates muddy the interpretation of high school results across locales and over time. Differing school-leaving rates over time make the meaning of score changes unclear. The same logic applies to cross-state comparisons. In the committee's judgment, NAGB and NCES should explore alternatives to the current assessment practices for twelfth graders. Testing high school students at an earlier grade (grade 10 or 11) or using longitudinal studies as the primary source of achievement data for high school students, with assessments still tied to NAEP frameworks, may bear consideration. In fact, NCES recently proposed a follow-up data collection on NAEP twelfth graders to study their postsecondary plans and opportunities. Following up on high school dropouts to include them in the NAEP samples also should be considered. Assessing high school students using multiple measurement methods—in smaller settings and perhaps with more engaging tasks—may moderate current motivation problems. Multiple-methods assessment may also permit collection of richer data on students' high school experiences and their plans for work, higher education, and the military. A shift to this strategy should occur in conjunction with the implementation of a new series of framework and assessments, otherwise the current main NAEP short-term trend lines for high school seniors would be disrupted. Streamlining the National and State Designs National NAEP and state NAEP use the same instrumentation but differ in the populations for which inferences are to be made. If NAEP was first being designed today, the idea of distinct samples and administration procedures for state and national estimates would no doubt be rapidly rejected in favor of a single design that attempts to address both population groups. Declining participation rates and earlier mentioned arguments about burden and inefficiency suggest the need to coordinate designs for national and state NAEP. In 1996 the NAE panel recommended that the scope and function of the state assessment program be reviewed in the context of an overall reevaluation of the NAEP program; at the same time, they noted that state NAEP is an important component of the NAEP program and recommended that it move beyond a developmental status. We agree with their assessment and recommend that the state component be accorded permanent status in the next congressional reauthorization of NAEP. As state NAEP moves from trial to permanent status, it makes sense to consider streamlining the national and state designs. NAEP historically has been successful at garnering participation in the state assessment program. State commitment to the 1998 program, however, declined in relation to earlier assessments. Fewer states signed up for 1998 testing than participated in 1996. The NAEP program suspects the decrease is attributable to

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress increasingly heavy state and local testing requirements. Without a mandate to participate in NAEP and without local feedback for NAEP testing, state and district testing directors may accord state NAEP lower priority than other assessments (Kelly Weddel, National Center for Education Statistics, personal communication, April 8, 1998). Separate state and national testing is costly, since it requires that national NAEP and state NAEP are essentially two different data collection programs. Recall that the state program costs as much as the national assessment for testing at fewer grades. As discussed by Rust (1996), a more coordinated design for the two components was considered by the contractors for the 1996 assessment, but it was rejected because of operational concerns involving equating and the choice of subjects and grades assessed. Despite this, in our view it may be possible to combine these two programs into a single design. Several differences between the current state and national designs merit attention in any discussion of their possible combination. State NAEP and national NAEP could be combined only if both assess the same grades and subjects. State NAEPs have assessed only fourth and eighth graders in mathematics, science, reading, and, in 1998, writing. And there appears to be little interest among the states in a state NAEP assessment of twelfth graders (DeVito, 1996). The coordination of state and national NAEP assessment cohorts and subjects is a solvable problem. For example, a combined program could assess reading, writing, mathematics, science, and any other subjects designated as core in grades four and eight. High school testing could continue with a national sample. A second difference between current state and national NAEP is that the administration of national NAEP is carried out by a NAEP contractor, whereas the administration of state NAEP is carried out by school personnel, with training and monitoring (on a sampling basis) by a NAEP contractor. The use of school personnel for test administration is substantially less costly (at least in terms of direct costs to NAEP) than the use of a NAEP contractor for that purpose. However, the difference in procedures raises questions about the comparability of data derived from these two different data collection procedures. Differences may be attributable to the actions of the test administrators, or they may be due to the potentially greater motivation associated with a test that yields scores for a state, rather than for the nation. Spencer (1997a) recently concluded that an equating adjustment may be necessary to bring estimates from data collected under state NAEP conditions into conformity with those from data collected under national NAEP conditions. He notes that comparisons of item responses in state and national NAEP showed that the scores were generally higher in state NAEP than in a subsample of national data comparable to the state data. The average differences were small enough to be attributable to sampling error (that is, reasonably consistent with the hypothesis of no true difference) in 1992, but not in 1990 or 1994 (Hartka and McLaughlin, 1994; Hartka et al., 1997a). For example, in 1994 the difference

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress between state NAEP and a comparable subset of the national NAEP in percent correct on a common set of items was 3.1 percent (56.0 percent versus 52.9 percent), which is substantial. Furthermore, differences between the percent correct observed under state NAEP and national NAEP coordinators were not uniform across states. This and other research suggests that sizable calibration samples may be needed to adjust or equate estimates derived from current state NAEP to make them comparable to those from national NAEP. It is unclear whether calibration samples would be necessary in every state in which main NAEP data would be derived from state NAEP administrations, or if calibration samples would be necessary in every state in which national NAEP data would be derived from state NAEP administrations. The need for calibration samples would reduce cost savings and sampling efficiencies from combining state and national NAEP. Hence, a goal of a coordinated design would be to avoid the need for calibration samples by minimizing differences in administration for the state and national NAEP samples. This design option seems preferable to analytical adjustments for the effects of differences in administration before data from different administration conditions are combined. The third difference between state and national NAEP may be in levels of nonsampling errors. Nonsampling errors are, in general, difficult to analyze or even to detect. However, in the 1994 assessment, there appeared to be some differences in the rates of school nonparticipation on state and national NAEP in the fourth grade (Hartka et al., 1997b). The implications for bias, however, are unclear (Spencer, 1997a, 1997b). To some extent, they depend on how well the mechanisms used to adjust for the effects of nonparticipation (namely substitution and reweighting) function to eliminate bias. Some research suggests that these mechanisms have worked reasonably well in NAEP (Hartka et al., 1997b). Specific suggestions for streamlining the national and state designs rely on additional research. More needs to be known about the effects of the differences in participation rates, administration, and other potential sources of bias. In a paper in the volume that accompanies this report, Kolen (1999b) recounted design alternatives proposed by Spencer (1997a) and Rust and Shaffer (1997). The alternatives vary in sampling approaches, administration procedures, and analytic adjustments. In proposing next steps, Kolen laid out research questions that must be answered in attempting to streamline NAEP designs: To what extent are the linking constants equal across states? Differences among states in ability, participation rates, and recruitment procedures should be investigated as variables that might influence linking constants. How large is the random error component in estimating the linking constants? To what extent does bias or systematic error influence the linking constants?

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress Do the differences in administration and recruitment conditions affect the constructs that are being measured by the NAEP assessments? These questions should be thoroughly addressed before any design for combining national and state NAEP samples is implemented under current recruitment and administration conditions. SUMMARY OF PROPOSED DESIGN FEATURES A number of characteristics distinguish our proposal for a new paradigm NAEP: Trends in reading, writing, mathematics, science, and other subjects for which there are sufficient resources would be estimated by core NAEP using large-scale assessment methods (separate testing for trend NAEP and main NAEP would be discontinued). National and state estimates would be reported by core NAEP, but efficiency in sampling and reduction in testing burden would be realized for the two designs. For subjects for which administration frequency generally prohibits the establishment of trend lines, testing would occur at the national level using multiple measurement methods. (Multiple-methods NAEP is described in Chapter 4.) Figure 2-7 shows new paradigm NAEP as we have discussed it. MAJOR CONCLUSIONS AND RECOMMENDATIONS Conclusions Conclusion 2A. The existence of multiple NAEP assessments is confusing and creates problems of burden, costliness, and inconsistent findings. Conclusion 2B. The current collection of meaningful NAEP data in the twelfth grade is problematic given the insufficient motivation of high school seniors and their highly variable curricula and dropout rates. Conclusion 2C. Because of its complexity and the many demands of its constituents, NAEP has developed multiple, dissociated reporting metrics and types of reports.

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress FIGURE 2-7 Measures of student achievement, including new paradigm NAEP. NOTE: TIMSS = Third International Mathematics and Science Study; NELS = National Education Longitudinal Study; ECLS = Early Childhood Longitudinal Study. Recommendations Recommendation 2A. For reading, writing, mathematics, and science, combine main NAEP and trend NAEP into a single design that preserves the measurement of trends and allows periodic updating of frameworks. If resources allow, trends could be established in other subject areas. Recommendation 2B. In those disciplines for which testing frequency generally prohibits the establishment of trend lines, assessment of student achievement should be accomplished using a variety of assessment methods and targeted student samples. Recommendation 2C. Alternatives to current NAEP assessment practices for twelfth graders should be explored, including: testing at grades 10 or 11, following up on high school dropouts to include them in NAEP's samples, and gathering data on the achievement of high school students primarily through NCES's longitudinal surveys. Recommendation 2D. Coordinate the sampling and administrative procedures for national and state NAEP in order to reduce burden and decrease costs.

OCR for page 56
GRADING THE NATION'S REPORT CARD: Evaluating NAEP and Transforming the Assessment of Educational Progress Recommendation 2E. The development of clear, comprehensible, and well integrated reports of NAEP results should remain a high priority, and reports should be redesigned to reflect NAEP's streamlined designs. Recommendation 2F. In order to accomplish the recommendations listed above, NAEP's research and development agenda should emphasize the following: Estimation of the effects of differences in sample definition, content, task types, and administration procedures for trend NAEP and main NAEP with subsequent derivation of links to support the use of a single trend line in each discipline Estimation of the effects of the differences in participation rates, administration procedures, and bias for state and national NAEP with subsequent development of more efficient sampling procedures Exploration of alternatives for obtaining meaningful data from high school students, and Development of clear, comprehensible reports and reporting metrics that provide descriptive, evaluative, and interpretive information in a carefully articulated and described report series.