National Academies Press: OpenBook
« Previous: 1 Creating a Coordinated System of Education Indicators
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

2

Streamlining the Design of NAEP

Summary Conclusion 2. Many of NAEP's current sampling and design features provide important, innovative models for large-scale assessments. However, the proliferation of multiple independent data collections—national NAEP, state NAEP, and trend NAEP—is confusing, burdensome, and inefficient, and it sometimes produces conflicting results.

Summary Recommendation 2. NAEP should reduce the number of independent large-scale data collections while maintaining trend lines, periodically updating frameworks, and providing accurate national and state-level estimates of academic achievement.

INTRODUCTION

NAEP provides important information about the academic achievement of America's youth, and the assessment has many strong design features. For example, NAEP's sampling, scaling, and analysis procedures serve as important models for the measurement community. The frameworks and innovative assessment materials serve as guides for state and local standards and assessment programs, and state NAEP results provide a useful backdrop for state and local assessment data.

In this chapter, we describe and evaluate NAEP's current sampling, data collection, analysis, and reporting methods. As background, we review the current

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

NAEP assessments, the sampling designs and analysis methods used, and the reports generated. We then briefly review the findings of previous evaluations and provide our own evaluation of the strengths and weaknesses of the current design. Our conclusions lead us to recommend strengthening NAEP's design and increasing its usefulness. We argue for reducing the number of independent large-scale data collections currently carried out. We discuss and provide proposals for:

  • Combining the trend NAEP and main NAEP designs in core subjects to preserve measurement of trends and allow updating of frameworks;

  • Using more efficient sampling procedures for national NAEP and state NAEP in order to reduce the burden on states and schools, decrease costs, and potentially improve participation rates;

  • Using multiple assessment methods to assess subject areas for which testing frequency generally prohibits the establishment of trend lines;

  • Exploring alternatives to the current assessment of twelfth graders by NAEP with the goal of minimizing bias associated with differential dropout rates and the differing course-taking patterns of older students, encouraging student effort, and expanding assessment domains to include problem solving and other complex skills critical to the transition to higher education, the workplace, and the military; and

  • Improving NAEP reports by providing (1) descriptive information about student achievement, (2) evaluative information to support judgments about the adequacy of student performance, and (3) contextual, interpretive information to help users understand students' strengths and weaknesses and better investigate the policy implications of NAEP results.

OVERVIEW OF NAEP'S CURRENT SAMPLING, DATA COLLECTION, ANALYSIS, AND REPORTING PROCEDURES

The National Assessment of Educational Progress is mandated by Congress to survey the academic accomplishments of U.S. students and to monitor changes in those accomplishments over time. Originally, NAEP surveyed academic achievement and progress with a single assessment; it has evolved into a collection of assessments that now includes the trend NAEP and main NAEP assessments. Main NAEP has both the national and state components. National NAEP includes the large-scale survey assessments and a series of special studies that are not necessarily survey-based. Special studies generally focus on specific portions of NAEP's subject domains and on the associated teaching and learning data. Current NAEP is described in the Introduction; Figure I-1 shows the components of the current program.

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Components of Current NAEP

The primary objective of trend NAEP is to provide trend lines of educational achievement for the U.S. population and major population subgroups over extended time periods. To avoid disruptions in trend lines caused by differences in NAEP administration or content, administration procedures and assessment items for trend NAEP are held as constant as possible over time.

Main NAEP is a larger assessment program than trend NAEP; it provides more precise estimates of educational achievement in population subgroups, includes more contextual variables, and is based on frameworks that are updated on a regular basis to reflect changes in curriculum and pedagogical thought. Again, main NAEP includes both national and state components. The state data collections are structured to provide estimates with adequate degrees of precision for individual states.

Tables 2-1 through 2-3 summarize the administrations of current NAEP since 1984, with assessments based on the same frameworks indicated by the same symbol and joined by lines to indicate whether trend estimation is feasible. Note that, in addition to the trend lines established using trend NAEP, short-term trend lines for main NAEP have been established in reading in national NAEP (grades 4, 8, and 12) and state NAEP (grade 4) from 1992 to 1998. Short-term trend lines from 1990 to 1996 have also been established in mathematics in national NAEP (grades 4, 8, and 12) and state NAEP (grade 8; and for 1992-1996, grade 4). However, as noted previously, the short-term trend lines of national NAEP and state NAEP reflect different assessment materials and student samples than does trend NAEP.

NAEP's multiple assessment programs evolved to preserve trend lines, at the same time allowing for updating of NAEP frameworks, and to obtain state-level NAEP estimates in main NAEP. The distinct programs allow the objectives of each component to be achieved without compromising the aims of the others. However, it may be unnecessary to have separate assessment programs with such similar objectives. Later in this chapter, we consider whether there is a compelling need for distinct assessment programs or whether these activities could be merged.

Sampling Designs for Current NAEP

The NAEP program differs fundamentally from other testing programs in that its objective is to obtain accurate measures of academic achievement for populations of students rather than for individuals. This goal is achieved using innovative sampling, scaling, and analysis procedures. We discuss these procedures next. Note that their description and evaluation is reliant on technical terminology that is difficult to translate into nontechnical terms. Technical language

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

TABLE 2-1 NAEP Frameworks, Designs, and Samples by Discipline and Year Tested: Reading and Writing

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

TABLE 2-2 NAEP Frameworks, Designs, and Samples by Discipline and Year Tested: Science and Mathematics

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

TABLE 2-3 NAEP Frameworks, Designs, and Samples by Discipline and Year Tested: Geography, History, and Civics

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

is used in this chapter in a way that is atypical of the remainder of this report.

NAEP tests a relatively small proportion of the student population of interest using probability sampling methods. Constraining the number of students tested allows resources to be devoted to ensuring the quality of the test itself and its administration, resulting in considerably better estimates than would be obtained if all students were tested under less controlled conditions. The use of sampling greatly reduces the burden placed on students, states, and localities in comparison to a national testing program that tests a substantial fraction of the nation's children.

The national samples for main NAEP are selected using stratified multistage sampling designs with three stages of selection. The samples since 1986 include 96 primary sampling units consisting of metropolitan statistical areas (MSAs), a single non-MSA county, or a group of contiguous non-MSA counties. About a third of the primary sampling units are sampled with certainty, and the remainder are stratified and one selected from each stratum with the probability proportional to size. The second stage of selection consists of public and nonpublic schools within the selected primary sampling units. For the elementary, middle, and secondary samples, independent samples of schools are selected with probability proportional to measures of size. In the final stage, 25 to 30 eligible students are sampled systematically with probabilities designed to make the overall selection probabilities approximately constant, except that more students are selected from small subpopulations, such as private schools and schools with high proportions of black or Hispanic students, to allow estimates with acceptable precision for these subgroups. In 1996 nearly 150,000 students were tested from just over 2,000 participating schools (Allen et al., 1998a).

The sampling design for state NAEP has only two stages of selection —schools and students within schools—since clustering of the schools within states is not necessary for economic efficiency (Allen et al., 1998b). In 1996 for each state, approximately 2,000 students in 100 schools were assessed for each grade. Special procedures were used in states with many small schools for reasons of logistical feasibility.

The national and state designs limit students to one hour of testing time, since longer test times are thought to impose an excessive burden on students and schools. This understandable constraint limits the ability to ask sufficient questions in the NAEP subject areas to yield accurate assessments of ability for individual students or subareas in a discipline. Time limits and NAEP's expansive subject-area frameworks have led to students receiving different but overlapping sets of NAEP items, using a form of matrix subsampling known as balanced incomplete block spiraling. The data matrix of students by test questions formed by this design is incomplete, yielding complications for the analysis. The analysis is currently accomplished by assuming an item response theory model for the items and drawing multiple plausible values of the ability parameters for sampled

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

students from their predictive distribution given the observed data (Allen et al., 1998a).

The school and student sampling plan for trend NAEP is similar to the design for national NAEP. Schools are selected on the basis of a stratified, three-stage sampling plan with counties or groups of contiguous counties defined by region and community type and selected with probabilities proportional to size. Public and nonpublic schools are then selected. In stage three, students within schools are randomly selected for participation. Within schools, students are randomly assigned to either mathematics/science or reading/writing assessment sessions, with item blocks assigned using a balanced, incomplete design. In 1996, between 3,500 and 5,500 students were tested in mathematics and science and between 4,500 and 5,500 in reading and writing (Campbell et al., 1997).

Analysis Methods for Current NAEP

Standard educational tests generally involve a large enough set of items to allow an individual student's proficiency on a tested topic to be captured with minor error from a simple summary, such as a total score or average test score. Since everyone takes the same test (or if different versions are used, the alternatives are carefully designed to be parallel), scores from different students can be compared directly and distributions of ability estimated. It was found that these simple approaches to analysis did not work well for the NAEP assessments since the tests are short, and they contain relatively heterogeneous items so that, in combination, multiple test forms capture NAEP subject areas adequately. As a result, simple summary scores for NAEP have sizable measurement error, and scores from different students can vary significantly because of differences in the items appearing on individual test forms.

The analysis for main NAEP and trend NAEP needs a glue in order to patch together results from heterogeneous forms assigned to heterogeneous students into clear pictures of educational proficiency. The glue of current NAEP analysis is supplied by item response theory modeling (IRT), which captures heterogeneity in items through item parameters and heterogeneity between students through individual student proficiency parameters. The basic forms of IRT used are the three-parameter logistic model (Mislevy et al., 1992) for multiple-choice or other right/wrong items and the generalized partial credit model of Muraki (1992) for items for which more than one score point is possible. Parameters are estimated for sets of homogeneous items by the statistical principle of maximum likelihood using the NAEP bilog/parscale program, which accommodates data in the form of the matrix samples collected (Allen et al., 1998a). A variety of diagnostic checks of these models are carried out, including checks of the homogeneity of the items (unidimensionality), goodness of fit of the models to individual items, and checks of cultural bias suggested by residual subgroup differences for students with similar estimated proficiencies.

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

The IRT models relate main NAEP and trend NAEP items to a set of K scales of unobserved proficiencies (Allen et al., 1998a). Each sample individual j is assumed to have a latent (K × 1) vector of unobserved proficiencies Θj, the values of which determine the deterministic component of responses to items related to each scale. Given the estimates of item parameters, the predictive distribution of each individual student's Θj can be estimated based on the observed performance on the items. This predictive distribution is multivariate and conditioned on the values of fixed background variables characterizing the student. For each student j, five sets of plausible values (Θj1,…,Θj5) are drawn from this predictive distribution. Five sets are drawn to allow the uncertainty about the latent proficiencies, given the limited set of test questions, to be reflected in the analysis. This step is an application of Rubin's (1987) multiple imputation method for handling missing data and is called the plausible values methodology in the NAEP context (Mislevy, 1985). Once plausible values are imputed for each individual student on a common scale, inferences can be drawn about the distribution of proficiencies, and proficiencies can be compared between subgroups and over time. For main NAEP, cutscores along the proficiency scales can also be determined to reflect levels of performance that are judged to represent basic, proficient, and advanced achievement.

Statistics of interest, such as proficiency distributions for the current NAEP samples and for subgroups defined by demographic characteristics, can be regarded as functions of aggregates of predicted latent proficiencies and student characteristics g(Θj,yj) for each student j. As in the analysis of many probability surveys, sampled individuals who contribute to the aggregate statistics are weighted to allow for differential inclusion probabilities arising from sample selection, unit nonresponse adjustments, and poststratification. The sampling variance of estimates, initially ignoring uncertainty in the Θj, is computed by jackknife repeated replication, an established method for computing sampling errors from surveys that take into account the stratification, clustering, and weighting of the complex sample design (Kish and Frankel, 1974). The uncertainty in the Θj is then incorporated by adding to the average jackknife sampling variance of the statistic computed for each set of plausible values {Θ(j,k):k = 1,…,j}, a component of imputation variance based on the variability of the estimates computed from each set of plausible values. This computation is an application of Rubin's (1987) multiple imputation method.

NAEP Reporting

From the program's inception, NAEP has had the goal of reporting results in formats that are accessible to potential users, promote valid interpretations, and are useful to NAEP's varied constituencies. The NAEP program currently produces an impressive array of reports, including:

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
  • Report Cards. These are the primary reports of the results of main NAEP. Results are presented for the nation, for states (if applicable), for major demographic groups, and in relation to key context variables (e.g., for public and private schools).

  • State Reports. These report results from main NAEP, with a report tailored specifically for each participating state.

  • Focus on NAEP/NAEP Facts. These are two-or four-page mini-reports that summarize NAEP frameworks, assessment results, and address topics of current and special interest.

  • Instructional Reports. These show performance data in relation to instructional background variables; they are issued 6 to 12 months after the Report Cards.

  • Focused Reports. These contain NAEP results from the special studies component of main NAEP (e.g., on the performance of English-language learners and students with disabilities or on special features of the assessments). These are also issued 6 to 12 months after the Report Cards.

  • Trends in Academic Progress. This is the primary report of the results of trend NAEP.

This differentiated product line is intended to serve a variety of audiences, with differing information needs, interest in findings, and sophistication in interpreting results.

SELECTED FINDINGS FROM PREVIOUS NAEP EVALUATIONS

Components of Current NAEP

Again, main NAEP and trend NAEP test different student populations and use distinct assessment exercises and administration procedures. The national and state components of main NAEP also use different administration procedures. There is a good deal of sympathy among policy makers, testing experts, and NAEP's evaluators for the need to streamline NAEP's designs (National Academy of Education, 1996, 1997; Forsyth et al., 1996).

NAEP's policy board, the National Assessment Governing Board (NAGB), has expressed concern over the inefficiency of maintaining main NAEP and trend NAEP; they recently announced plans to investigate more efficient design options. They said ''[it] may be impractical and unnecessary to operate two separate assessment programs.'' They have called for a "carefully planned transition … to enable the main National Assessment to become the primary way to measure trends in reading, writing, mathematics, and science in the National Assessment program" (National Assessment Governing Board, 1996:10). NAGB also registered concern about the inefficiency and burden imposed on states by separate state and national NAEP data collections. To address this concern for future

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

assessments, NAGB has said that "where possible, changes in national and state sampling procedures shall be made that will reduce burden on states, increase efficiency, and save costs" (National Assessment Governing Board, 1996:7).

Sampling Designs for Current NAEP

As we do later in this chapter, the National Academy of Education (NAE; 1992, 1993, 1996), KPMG Peat Marwick LLP and Mathtech (1996), the Design/Feasibility Team (Forsyth et al., 1996), and others have examined the sampling designs for NAEP. The NAE panel focused on the conduct and results of the state component of main NAEP in 1990, 1992, and 1994, reviewing sampling and administration practices for the state assessments. KPMG Peat Marwick and the Design/Feasibility Team examined both the national and state programs.

The National Academy of Education panel found that sampling procedures for the state assessment program were consistent with best practice for surveys of this kind and concluded that sampling and administration were done well for the state program (National Academy of Education, 1996). They expressed concern, however, about declining school participation rates as the program progressed and recommended that the National Assessment Governing Board (NAGB) and the National Center for Education Statistics (NCES) consider design changes to decrease sample size requirements or otherwise reduce the burden on states, particularly small states. They warned that heavy program requirements might threaten school and state participation rates, particularly in years when multiple subjects and grades are tested. The panel cautioned that diminished participation in the state program might have deleterious effects on national NAEP.

They and others have reviewed school and student sampling for national NAEP and concluded that the national samples are drawn by experienced staff using well-established scientific, multistage stratified probability sampling designs (KPMG Peat Marwick LLP and Mathtech, 1996). As noted earlier, the sampling design for trend NAEP parallels that for national NAEP.

As explained above, NAEP's inclusive frameworks require that a balanced incomplete block design be used for test administration. Although reviewers applaud the ingenuity of the design, some worry about the complexity and fragility of the analytic machinery the design necessitates (National Academy of Education, 1996). The NAEP program has been urged to explore alternatives for simplifying the design. The NAE panel warned that the frameworks for main NAEP push the limits of form design and may strain current methods, particularly in light of recent pressure to hasten scaling, analysis, and reporting. Reviewers point to anomalies in NAEP findings as indicators of design stress and call for research to develop a more streamlined design for the assessment (U.S. General Accounting Office, 1993; National Academy of Education, 1993; Hedges and Venesky, 1997).

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Analysis Methods for Current NAEP

Continuing in this vein, reviewers observe that the complex models that allow NAEP to maximize information while minimizing testing burden for examinees are beginning to fray (National Academy of Education, 1996). They note that programmatic changes have burdened the already complex statistical design, citing the introduction of innovative assessment tasks that call for mathematical models suited to multicategory scoring and violations of local item independence; the need to repeat scoring, scaling, and analysis for each state participating in the state testing program; and increased pressure for innovation in assessment design and technology. After the 1994 administration, the NAE panel called for studies to validate the current analysis and scaling models. They asked for research to test the strength of the models used and their robustness to violations of assumptions (National Academy of Education, 1996). They also sought mechanisms for checking the integrity of NAEP data prior to their release.

NAEP Reporting

In past reviews of the NAEP program, the National Academy of Education defined four criteria for successful reporting (1996); in laying criteria out, they praised the program's steady work in making progress toward these ends. The NAE panel examined the: (1) accuracy of results, (2) likelihood results would be interpreted correctly by the intended audience(s), (3) extent to which the results are accessible and adequately disseminated, and (4) timeliness with which results were made available.

The NAE panel made many positive statements about NAEP reports. They praised NAEP's innovative graphic formats for conveying the statistical significance of differences between states and over time; they applauded the map graphics, the more prevalent use of charts, simplified data tables, and shorter reports. They commented favorably on the introduction of Focused reports and on the summary reports for states.

However, the NAE panel and others have been critical of the length of time it takes to issue NAEP reports. The 1992 Report Card in reading followed test administration by more than 2 years; this time lag between administration and reporting was the longest ever experienced. The NAEP program has been strongly encouraged to press for more timely reporting.

Other reviewers join the NAE panel in making suggestions for the improvement of NAEP reports (Hambleton, 1997; Hambleton and Slater, 1996; Jaeger, 1992, 1996, 1997; Wainer, 1997; Silver and Kenney, 1997; Barron, 1999; Widmeyer Group, 1993). These analysts have encouraged NAEP's sponsors to:

  • continue ongoing efforts to search for data displays and report formats that are more comprehensible to the lay reader and more likely to yield correct

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

interpretations, including enlisting media representatives to help identify the most comprehensible methods for displaying results;

  • produce more focused research reports for various audiences, including reports that draw on other research to corroborate and inform relationships observed in NAEP data;

  • provide more examples of assessment tasks and student responses; and

  • explore ways to support states in generating their own reports of NAEP findings.

THE COMMITTEE'S EVALUATION

We begin our own analysis of NAEP's design with a discussion of sampling, analysis, and reporting issues. From there, we turn to discussion of NAEP's multiple data collections. Again, this discussion relies on technical terminology to a greater extent than other chapters.

Sampling Designs for Current NAEP

The role of probability sampling is crucial for current NAEP, since it minimizes selection biases in making inferences from the sample to the population. As with any sample survey, NAEP is equipped to provide estimates at high levels of aggregation (national, state, gender), but it is not sufficiently fine-grained to provide estimates for low levels of aggregation, such as for schools or school districts. We, too, judge that the NAEP samples are selected using well-established stratified probability sampling designs by highly experienced contractors.

We share the concerns of other evaluators, NAGB, and NCES, however, about the testing burden NAEP imposes on small and low-density states and large school districts. In fact, we note that for the 1998 administration, the participation rate dropped to 40 states from 44 in 1996 for mathematics; 43 states participated in 1996 for science. We discuss these concerns further in conjunction with our recommendations for streamlining NAEP's design.

Furthermore, although we question the analytical complexity that marks main NAEP's matrix sampling design, we note that the design, whereby examinees receive only a subset of items, seems an inescapable feature of NAEP. The alternative approach of limiting students to a narrow subject matter area does not permit broad assessment and the measurement of associations between achievement in different areas of knowledge within a particular subject.

Analysis Methods for Current NAEP

The analysis of NAEP is perhaps uniquely complex among national probability surveys, bringing together modern ideas in survey sampling, incomplete

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

data analysis, and item response theory to yield inferences on a diverse set of topics. The complexity of the enterprise has led to appeals to simplify the procedure and yield results that are more stable, less time-consuming to produce, more easily understood by nontechnical audiences, and more easily used by secondary analysts. In a paper that appears in the volume that accompanies this report, Barron (1999) describes the analytic difficulties currently faced by secondary users of NAEP data.

On one hand, both internal (Forsyth et al., 1996) and external (KPMG Peat Marwick LLP and Mathtech, 1996) reviews of the technical details of the NAEP analysis process suggest that much of the intricacy of the analysis methods appears justified. Alternative analysis approaches for the existing design may involve a sacrifice of statistical efficiency, and major modifications of the design to simplify the analysis would involve sacrifices in the depth and value of the surveys. On the other hand, the statistical machinery of current NAEP is thought to be fragile, and our evaluation points us to questions about simplifying the analysis methods. We discuss a number of these issues here.

Standard Error Calculation

An important feature of the jackknife method of computing standard errors is that it incorporates features of the sample design, such as clustering, stratification, and weighting; evidence suggests that standard errors computed using simple random sampling assumptions for NAEP would be seriously underestimated (Allen et al., 1996, 1998a). One approximation involved in the process of computing standard errors is worthy of mention and further study.

Item parameters are fixed at their estimated values when plausible values of the latent proficiencies are drawn. This approach does not allow uncertainty in the item parameter estimates to be reflected in the plausible values. Rubin (1987) calls this form of multiple imputation improper. Although the estimates based on improper multiple imputation are valid, standard errors tend to be underestimated, particularly when the fraction of missing information is large. One possible fix is to include the entire process of fitting the item response models and creating the plausible values in the jackknife standard error calculation. This option was considered by NAEP analysts, but it imposes an added computational burden to a process that already involves a lot of computing. A less burdensome option is to estimate the IRT models on a different jackknifed sample prior to imputing each set of plausible values (Heitjan and Little, 1991). This approach incorporates uncertainty in the estimated item parameters in the plausible values at the expense of requiring five fits of each IRT model rather than just one. Studies to assess the impact of these refinements on standard errors appear worthwhile.

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

Dimensionality

The dimensionality of the data is central to the scaling, analysis, and reporting of NAEP's large-scale assessment results. The current analytical approach is quite strongly tied to the assumptions of the IRT models used in the analysis. NAEP analysts spend some time assessing the fits of individual items to the models and checking by differential item functioning analysis that group differences do not remain after accounting for the estimates of proficiency that the items are intended to reflect. This model-checking activity is important and useful, but options for modifying the analysis based on its results appear limited to rejecting suspect items from the analysis. The sensitivity of answers to more wide-ranging modifications of the basic models, including models with higher dimensionality, appear worth exploring, to increase confidence that results are not unduly tied to unrealistic model assumptions. Studies that assess the dimensionality of NAEP data (Carlson, 1996; Zhang, 1997; Yu and Nandakumar, 1996; Sireci et al., 1999) do not appear to have uncovered major departures from unidimensionality, but the impact of potential violations on the final NAEP inferences is largely unknown.

Content Coverage

Reviews of the analysis process for main NAEP (KPMG Peat Marwick LLP and Mathtech, 1996) have concluded that simplifications of the analysis would limit the usefulness of results unless the underlying NAEP design is significantly modified. One possible alternative design would limit tests of individual students to relatively narrow content areas, with sufficient numbers of questions given to yield relatively precise estimates of proficiency in these areas from simple summaries such as total scores. The distributions of performance in each of these narrow areas of proficiency could then be easily computed as empirical score distributions, with standard errors computed using jackknife repeated replication. Summary measures such as means of proficiencies aggregated over the narrow areas would also be easy to derive, but distributions of the aggregate summaries would not be available, since there would be no information on how each student performs on areas other than the one tested. Thus, results from this simplified design and analysis would appear to be much more limited. Furthermore, the information obtained from each student would be proscribed by focusing on the relatively precise measurement of a particular skill, rather than less precise measurement of proficiencies for a wider range of skills.

NAEP Reporting

Much attention has focused on improving NAEP reports, and progress recently has been made in more quickly issuing the primary reports. However, in

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

our view, the clarity of NAEP's main messages, the presentation of tables, graphs, and statistical data, and the general utility of the reports still can be much improved. In addition, we note that states continue to ask for a shorter timetable for releasing state results.

Specifically, our analysis of reporting for current NAEP and its timeliness, clarity, and utility suggests the following.

Timeliness

As we have noted, a major effort of the NAEP program has been to produce reports, especially the Report Cards, in a more timely manner. Earlier in the program, national results were issued 18 to 24 months after administration, and trend reports were published as long as 24 to 30 months after the data collection. The 1996 NAEP mathematics and science Report Cards (Reese et al., 1997; O'Sullivan et al., 1997) were published 11 and 13 months, respectively, after the administration. The 1996 trend report was issued in August 1997 (Campbell et al., 1997), approximately 15 months after the administration was completed. We find these time lines impressive.

Clarity

Despite recent improvements in reporting student achievement results for the nation, states, and major demographic groups, NAEP reports still are frequently viewed as being overly complex. We believe the complexity of reports is partly a function of the complexity of the program. As we have noted, the design and analysis of NAEP data are complex and hard to understand, even for relatively sophisticated users.

The development of clear and comprehensible reports for nontechnical readers is and should continue to be a high priority for the program. The committee believes that it is possible to present results in clear and comprehensible ways to nontechnical readers. The development of clear methods of presentation should continue to be a high priority for NAEP analysts; the limitations of the data should be deliberately and fully communicated in the reports. Specific efforts to enhance report clarity might include:

  • providing examples of assessment tasks to aid with interpretation,

  • field-testing all tables and displays prior to release,

  • developing and including a glossary of terms with reports,

  • developing summary reports that present NAEP findings in a concise format, and

  • developing and providing states with a protocol of state assessment media press packages complete with appropriate and inappropriate test interpretations.

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

We make other suggestions for increasing the clarity of reports below and in Chapter 4.

Reporting Metrics

An obstacle to understanding NAEP data comes with the fact that results are reported on a proficiency scale that is indirectly tied to performance on specific questions. As noted by the NAGB Design/Feasibility Team (Forsyth et al., 1996), a more promising approach may be to work harder to present results in a more intuitive and easily understood metric. The Design/Feasibility Team describes an approach that relates proficiencies to performance on a broad-ranging collection or market basket of items, rather than to the latent proficiency scales that emerge directly from the IRT models. They explained that plausible value predictions of performance on a standard market basket of items could be created and summary results presented in terms of these predictions. The underlying IRT models still would provide the glue for calibrating across items and individuals, but the analysis would be in terms of a more understandable metric. The item collection would be published so that users could review the items students attempted. We urge NAGB and NCES to conduct research on the market basket and other reporting metrics with potential to simplify the interpretation of results by NAEP's users. Research on improved reporting metrics should receive as much attention as research on NAEP's psychometrics.

We also encourage the NAEP program to reexamine the way that scale score and achievement-level results are reported in NAEP documents. When presenting descriptive and evaluative results in the Report Cards, the data should not be presented in disassociated chapters or separate reports written by different authors. The findings should be discussed in an integrated way and accompanied by a description of the relationship between the two portrayals. Reports should indicate how well students performed and how well that performance stacked up against expectations.

Utility

Despite the variety of reports, many users do not yet feel that NAEP reports serve them as well as they could. The concerns voiced by various current and potential audiences for NAEP reports were identified and summarized for NAGB in 1993 in a review by the Widmeyer Group. In general, policy makers, teachers, administrators, and parents said that achievement data are important but that NAEP results and reports do not point them to potential implications for policy and practice. We contend that the NAEP program should report descriptive (scale score/proficiency), evaluative (achievement levels), and contextual, interpretive information in a well integrated report series. Users would thus have on hand (1) information about levels of student performance, (2) an evaluation of how well

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

student achievement measures up to performance standards, and (3) information that helps them better understand student strengths and weaknesses and guides them in thinking about what to do in response to the findings.

It will be challenging to present this comprehensive set of information in ways that make clear the interrelationships between the portrayals and the unique contributions of each report. NAEP Report Cards have and should include the descriptive and achievement-level results in the same reports. By necessity. reports that provide interpretive information internal to the test—that is, based on in-depth analyses of students' responses to individual items or sets of items—would follow initial reports. Generation of reports that draw on data from the coordinated system to help interpret NAEP results also would follow in a secondary reporting stage.

The release of second-stage reports should be guided by a dissemination strategy that seeks to garner as much, if not more, attention from the press, the public, and policy makers as the initial reports. The associations between reports and their unique objectives and contributions should be clearly and prominently articulated.

We extend this idea to the reporting of trend and main NAEP data. If mechanisms for integrating trend NAEP and main NAEP are implemented, the initial Report Cards should present current and trend results in tandem (not in separate reports or separate sections of the same report), since the trend information provides an important context for interpreting current results.

TOWARD A MORE UNIFIED DESIGN FOR NAEP

Like NAGB and NCES, the committee contends that NAEP's designs should be streamlined. There are a number of arguments for seeking to combine the data collection efforts. Several already have been mentioned; others are discussed next.

Rationale for Combining Designs

Inconsistent Findings

The existence of multiple assessments is potentially confusing to NAEP's constituencies; for example, it can and has led to situations in which the trend in results in two successive national NAEPs are in the opposite direction from trends in successive trend NAEPs over the same time period. Figures 2-1 through 2-6 show NAEP results for reading and mathematics by grade for the national NAEP and the trend NAEP designs. The potential for confusion is illustrated in Figure 2-1, which plots summary results from grade 4 reading. For example, mean NAEP reading scores for grade 4 went up between 1988 and 1990 for national NAEP and down for trend NAEP over the same period. Indeed, for the

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

FIGURE 2-1 Mean NAEP reading scores (fourth grade and age 9).

FIGURE 2-2 Mean NAEP reading scores (eighth grade and age 13).

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

FIGURE 2-3 Mean NAEP reading scores (twelfth grade and age 17).

FIGURE 2-4 Mean NAEP mathematics scores (fourth grade and age 9).

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

FIGURE 2-5 Mean NAEP mathematics scores (eighth grade and age 13).

FIGURE 2-6 Mean NAEP mathematics scores (twelfth grade and age 17).

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

12 most obvious data comparisons across grades and disciplines (the three grades and four time periods from 1990 to 1992 and from 1992 to 1996 in mathematics and from 1988 to 1990 and from 1992 to 1994 in reading), four periods show similar results on national NAEP and trend NAEP and eight show dissimilar results.1 Replication is useful for uncovering methodological inconsistencies, but it is not obvious what conclusions can be drawn from discordant results across somewhat different designs.

Meaningfulness of Trend Frameworks

Trend NAEP is designed to keep changes in design, administration, and questions to a minimum. The anomalies in the reading results for 1986 and 1994 NAEP (Zwick, 1991; Hedges and Venesky, 1997) demonstrated that very modest changes in data collection and assessment procedures can have unexpectedly large effects on assessment results. Analyses of the 1986 incident, including a set of randomized experiments built into the subsequent (1988) assessment, led measurement specialists to conclude that if you want to measure change, don't change the measure.

Despite the obvious wisdom of this approach in the short run, it may have some drawbacks over longer periods of time. It is not inconceivable that, held constant for long periods of time, frameworks become increasingly irrelevant by failing to reflect changes in curricula and instructional practice. An increasing gap between assessment and practice could make estimated trends from assessments built to old frameworks potentially misleading.

We examined this assertion in an attempt to push our thinking about design alternatives. We commissioned research to assess the relevance of NAEP trend items to current standards and instructional practice (Zieleskiewicz, 1999). Middle school teachers and disciplinary specialists were asked to examine a set of trend NAEP materials and main NAEP items to determine their relevance to current curriculum and instruction in mathematics and science. Respondents were asked about the extent to which students currently have opportunities to master the knowledge and skills addressed by the items. They also relayed their perceptions of the relevance of the trend NAEP and main NAEP items to national disciplinary standards. Zieleskiewicz sought the views of teachers in states on the vanguard of standards-based reform and in a randomly selected group of states. She also surveyed disciplinary specialists active in mathematics and science reform at the national level.

The resulting data are described and summarized in a volume of papers commissioned to inform our evaluation (National Research Council, 1999). The

1  

Two of the inconsistencies in national NAEP and trend NAEP data may be attributable to anomalies in the 1994 reading results for grades 4 and 12.

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

data show that teachers and disciplinary specialists rated trend NAEP and main NAEP items similarly on students' opportunity to learn tested knowledge and skills and on their relevance to current curricula and national standards. That is, in this trial and on these dimensions, disciplinary specialists and middle school mathematics and science faculty did not distinguish between trend items and items written to current frameworks. We do not know whether similar data would result for the other grade levels in mathematics or science of for other subject areas, but for this grade and these subjects, the data showed that trend and main NAEP items are similarly aligned with current practice and standards. The findings run counter to the common presumption that trend instrumentation is dated and bolster arguments for developing and maintaining a single trend line for current NAEP. The data are consistent with arguments for streamlining trend assessment for current NAEP.

Costliness and Burden

As we have said, the current NAEP designs involve separate samples, tests, and data collection procedures. This practice is costly, since it constitutes essentially three different data collection programs. Past evaluators have discussed direct costs and attempted to estimate indirect costs for the state and national designs (KPMG Peat Marwick LLP and Mathtech, 1996). Currently, assessment of two subjects and two grades by state NAEP is nearly as expensive as testing two subjects at three grades by national NAEP. In addition, the separate data collections place a burden on small and low-population states and large districts that may have had a deleterious effect on participation. Additional inefficiencies are associated with ongoing administration of assessments for every trend line the NAEP program supports. As currently configured, every cycle of trend NAEP administration, analysis, and reporting adds $4,000,000 to NAEP program costs.

Merging the Main NAEP and Trend NAEP Designs

Many assert that maintaining a statistical series is the most important thing NAEP does (Forsyth et al., 1996), and we agree that this should remain a major priority in the future. However, the current means for achieving this goal are inefficient and not without problems, as discussed above. It is the committee's judgment that trend and main NAEP should be reconfigured to allow accurate and efficient estimation of trends. Our conception of a combined design would accord main NAEP the more stable characteristics of trend NAEP in repeated administrations over 10-to-20 year time spans. The main objective would be to minimize the design flux that has characterized main NAEP, with the goal that it reliably assess not only current level but also trends in core subject areas. This proposal is consistent with the ideas about NAEP's redesign offered by NAGB (National Assessment Governing Board, 1997), the NAGB Design/Feasibility

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

Team (Forsyth et al., 1996), and the NAE panel (National Academy of Education, 1997).

In a paper published in a volume that accompanies this report, Michael Kolen (1999a) offered a number of suggestions for phasing out the current trend data collection and continuing with main NAEP while maintaining a long-term trend line. As background for his proposals, Kolen discussed differences between the assessments, including variation in content, operational procedures, examinee subgroup definitions, analysis procedures, and results.

In cataloguing differences between the two designs, Kolen explained that the content specifications for trend NAEP were developed and have been stable since 1983/1984 for reading and writing and 1985/1986 for mathematics and science, whereas the frameworks for main NAEP have evolved. He noted that trend NAEP has a higher proportion of multiple-choice than constructed-response items in comparison to main NAEP. In main NAEP, he said, students are given test items in a single subject area, and in trend NAEP students test in more than one subject area.

Kolen also explained that main NAEP oversamples minority students to permit subgroup comparisons, but trend NAEP does not. Subgroup definitions also differ for the two designs. Main NAEP identifies students' race and ethnicity information from multiple sources, giving priority to student-reported information. Trend NAEP uses administrators' observations to designate students' race. Kolen noted the differences between grade-based sampling for main NAEP and age-based sampling and reporting for trend NAEP.

After recounting differences between the two assessments, Kolen presented five designs for estimating long-term trends with NAEP and laid out the statistical assumptions, linking studies, and research required to develop and support the designs. In one design, Kolen proposed monitoring long-term trends with the main NAEP assessment and using overlapping NAEP assessments to initially link main NAEP to trend NAEP and then to link sequential assessments whenever assessment frameworks and/or designs are modified. He explained that implementation of this design relies on the conduct of research to estimate the effects of differences between subgroup and cohort definitions and administration conditions on main NAEP and trend NAEP. Research to examine the effects of content differences and differences in item types for trend NAEP, main NAEP, and successive assessments would also needed. Linking and scaling research would be needed initially to place main NAEP results on the trend scale or trend results on the main scale and, again, to continue the trend line as NAEP evolves. Because long-term trends would be assessed with main NAEP in this design, main NAEP must be more stable than it has been in the past, Kolen explained.

In another design, Kolen suggested allowing main NAEP to change to reflect current curricula and use a separate trend assessment, with occasional updating, to maintain a trend line. With this design, modest changes in the content of trend NAEP would be allowed to ensure its relevance, but the operational conditions of

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

the assessment would remain constant. This design would allow for the replacement of some items in the trend instruments and alternate forms of the trend instruments would be equated. The design would continue to provide long-term trend estimates without an extensive research program, but it requires the continuation of both assessments. For Kolen's discussion of these and alternative models, see the volume of research papers that accompanies this report.

Assessing NAEP Disciplines

It is important to note that proposals for merging trend NAEP and main NAEP are limited to the large-scale assessments in reading, writing, mathematics, and science. We discuss this construction in greater detail below but, again, note that assessment of these disciplines using large-scale assessment methods is part of the core NAEP component of our proposal for new paradigm NAEP (see Chapter 1). If history, geography, or other disciplines are assessed frequently enough in the large-scale survey program to support trend estimation, these too would constitute core NAEP, but tracking trends back to the 1970s and 1980s would not be possible in these subjects.

As we stated in Chapter 1, NAEP should address those disciplines for which testing frequency generally prohibits the establishment of trend lines using multiple assessment methods, rather than as components of the NAEP large-scale assessment program. This approach has two possible advantages: (1) by reducing scale and releasing resources, it enables more in-depth treatment of these subject areas and the teaching and learning opportunities that define them and (2) it affords more frequent measurement and trend estimation for core disciplines and may allow more thorough reporting in these subjects. We include the assessment of noncore subjects in our proposal for multiple-methods NAEP. Chapter 4 provides further discussion of this and other components of multiple-methods NAEP.

High School Testing

A number of conditions point to insufficient clarity about the meaning of results for high school examinees under the current designs. First, test administrators observe that some high school examinees do not make a serious effort to answer NAEP questions, rendering their scores of questionable value. The administrators' observations are corroborated by the high omit and noncompletion rates of 17-year-olds on trend NAEP and seniors on national NAEP. The nonresponse rates are particularly high on the constructed-response items for national NAEP. Despite concerted effort to date, the NAEP program and stakeholders have been unable to identify workable incentives for high school students' participation and effort.

Second, the curricula of high school students are variable; course-taking

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

patterns are sufficiently variable that it is difficult to render judgments about students' opportunity to learn tested content, particularly for older high school students. Finally, differential dropout rates muddy the interpretation of high school results across locales and over time. Differing school-leaving rates over time make the meaning of score changes unclear. The same logic applies to cross-state comparisons.

In the committee's judgment, NAGB and NCES should explore alternatives to the current assessment practices for twelfth graders. Testing high school students at an earlier grade (grade 10 or 11) or using longitudinal studies as the primary source of achievement data for high school students, with assessments still tied to NAEP frameworks, may bear consideration. In fact, NCES recently proposed a follow-up data collection on NAEP twelfth graders to study their postsecondary plans and opportunities. Following up on high school dropouts to include them in the NAEP samples also should be considered. Assessing high school students using multiple measurement methods—in smaller settings and perhaps with more engaging tasks—may moderate current motivation problems. Multiple-methods assessment may also permit collection of richer data on students' high school experiences and their plans for work, higher education, and the military. A shift to this strategy should occur in conjunction with the implementation of a new series of framework and assessments, otherwise the current main NAEP short-term trend lines for high school seniors would be disrupted.

Streamlining the National and State Designs

National NAEP and state NAEP use the same instrumentation but differ in the populations for which inferences are to be made. If NAEP was first being designed today, the idea of distinct samples and administration procedures for state and national estimates would no doubt be rapidly rejected in favor of a single design that attempts to address both population groups. Declining participation rates and earlier mentioned arguments about burden and inefficiency suggest the need to coordinate designs for national and state NAEP. In 1996 the NAE panel recommended that the scope and function of the state assessment program be reviewed in the context of an overall reevaluation of the NAEP program; at the same time, they noted that state NAEP is an important component of the NAEP program and recommended that it move beyond a developmental status. We agree with their assessment and recommend that the state component be accorded permanent status in the next congressional reauthorization of NAEP. As state NAEP moves from trial to permanent status, it makes sense to consider streamlining the national and state designs.

NAEP historically has been successful at garnering participation in the state assessment program. State commitment to the 1998 program, however, declined in relation to earlier assessments. Fewer states signed up for 1998 testing than participated in 1996. The NAEP program suspects the decrease is attributable to

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

increasingly heavy state and local testing requirements. Without a mandate to participate in NAEP and without local feedback for NAEP testing, state and district testing directors may accord state NAEP lower priority than other assessments (Kelly Weddel, National Center for Education Statistics, personal communication, April 8, 1998).

Separate state and national testing is costly, since it requires that national NAEP and state NAEP are essentially two different data collection programs. Recall that the state program costs as much as the national assessment for testing at fewer grades. As discussed by Rust (1996), a more coordinated design for the two components was considered by the contractors for the 1996 assessment, but it was rejected because of operational concerns involving equating and the choice of subjects and grades assessed. Despite this, in our view it may be possible to combine these two programs into a single design.

Several differences between the current state and national designs merit attention in any discussion of their possible combination. State NAEP and national NAEP could be combined only if both assess the same grades and subjects. State NAEPs have assessed only fourth and eighth graders in mathematics, science, reading, and, in 1998, writing. And there appears to be little interest among the states in a state NAEP assessment of twelfth graders (DeVito, 1996). The coordination of state and national NAEP assessment cohorts and subjects is a solvable problem. For example, a combined program could assess reading, writing, mathematics, science, and any other subjects designated as core in grades four and eight. High school testing could continue with a national sample.

A second difference between current state and national NAEP is that the administration of national NAEP is carried out by a NAEP contractor, whereas the administration of state NAEP is carried out by school personnel, with training and monitoring (on a sampling basis) by a NAEP contractor. The use of school personnel for test administration is substantially less costly (at least in terms of direct costs to NAEP) than the use of a NAEP contractor for that purpose. However, the difference in procedures raises questions about the comparability of data derived from these two different data collection procedures. Differences may be attributable to the actions of the test administrators, or they may be due to the potentially greater motivation associated with a test that yields scores for a state, rather than for the nation.

Spencer (1997a) recently concluded that an equating adjustment may be necessary to bring estimates from data collected under state NAEP conditions into conformity with those from data collected under national NAEP conditions. He notes that comparisons of item responses in state and national NAEP showed that the scores were generally higher in state NAEP than in a subsample of national data comparable to the state data. The average differences were small enough to be attributable to sampling error (that is, reasonably consistent with the hypothesis of no true difference) in 1992, but not in 1990 or 1994 (Hartka and McLaughlin, 1994; Hartka et al., 1997a). For example, in 1994 the difference

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

between state NAEP and a comparable subset of the national NAEP in percent correct on a common set of items was 3.1 percent (56.0 percent versus 52.9 percent), which is substantial. Furthermore, differences between the percent correct observed under state NAEP and national NAEP coordinators were not uniform across states.

This and other research suggests that sizable calibration samples may be needed to adjust or equate estimates derived from current state NAEP to make them comparable to those from national NAEP. It is unclear whether calibration samples would be necessary in every state in which main NAEP data would be derived from state NAEP administrations, or if calibration samples would be necessary in every state in which national NAEP data would be derived from state NAEP administrations. The need for calibration samples would reduce cost savings and sampling efficiencies from combining state and national NAEP. Hence, a goal of a coordinated design would be to avoid the need for calibration samples by minimizing differences in administration for the state and national NAEP samples. This design option seems preferable to analytical adjustments for the effects of differences in administration before data from different administration conditions are combined.

The third difference between state and national NAEP may be in levels of nonsampling errors. Nonsampling errors are, in general, difficult to analyze or even to detect. However, in the 1994 assessment, there appeared to be some differences in the rates of school nonparticipation on state and national NAEP in the fourth grade (Hartka et al., 1997b). The implications for bias, however, are unclear (Spencer, 1997a, 1997b). To some extent, they depend on how well the mechanisms used to adjust for the effects of nonparticipation (namely substitution and reweighting) function to eliminate bias. Some research suggests that these mechanisms have worked reasonably well in NAEP (Hartka et al., 1997b).

Specific suggestions for streamlining the national and state designs rely on additional research. More needs to be known about the effects of the differences in participation rates, administration, and other potential sources of bias. In a paper in the volume that accompanies this report, Kolen (1999b) recounted design alternatives proposed by Spencer (1997a) and Rust and Shaffer (1997). The alternatives vary in sampling approaches, administration procedures, and analytic adjustments. In proposing next steps, Kolen laid out research questions that must be answered in attempting to streamline NAEP designs:

  • To what extent are the linking constants equal across states? Differences among states in ability, participation rates, and recruitment procedures should be investigated as variables that might influence linking constants.

  • How large is the random error component in estimating the linking constants?

  • To what extent does bias or systematic error influence the linking constants?

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
  • Do the differences in administration and recruitment conditions affect the constructs that are being measured by the NAEP assessments?

These questions should be thoroughly addressed before any design for combining national and state NAEP samples is implemented under current recruitment and administration conditions.

SUMMARY OF PROPOSED DESIGN FEATURES

A number of characteristics distinguish our proposal for a new paradigm NAEP:

  • Trends in reading, writing, mathematics, science, and other subjects for which there are sufficient resources would be estimated by core NAEP using large-scale assessment methods (separate testing for trend NAEP and main NAEP would be discontinued).

  • National and state estimates would be reported by core NAEP, but efficiency in sampling and reduction in testing burden would be realized for the two designs.

  • For subjects for which administration frequency generally prohibits the establishment of trend lines, testing would occur at the national level using multiple measurement methods. (Multiple-methods NAEP is described in Chapter 4.)

Figure 2-7 shows new paradigm NAEP as we have discussed it.

MAJOR CONCLUSIONS AND RECOMMENDATIONS

Conclusions

Conclusion 2A. The existence of multiple NAEP assessments is confusing and creates problems of burden, costliness, and inconsistent findings.

Conclusion 2B. The current collection of meaningful NAEP data in the twelfth grade is problematic given the insufficient motivation of high school seniors and their highly variable curricula and dropout rates.

Conclusion 2C. Because of its complexity and the many demands of its constituents, NAEP has developed multiple, dissociated reporting metrics and types of reports.

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

FIGURE 2-7 Measures of student achievement, including new paradigm NAEP. NOTE: TIMSS = Third International Mathematics and Science Study; NELS = National Education Longitudinal Study; ECLS = Early Childhood Longitudinal Study.

Recommendations

Recommendation 2A. For reading, writing, mathematics, and science, combine main NAEP and trend NAEP into a single design that preserves the measurement of trends and allows periodic updating of frameworks. If resources allow, trends could be established in other subject areas.

Recommendation 2B. In those disciplines for which testing frequency generally prohibits the establishment of trend lines, assessment of student achievement should be accomplished using a variety of assessment methods and targeted student samples.

Recommendation 2C. Alternatives to current NAEP assessment practices for twelfth graders should be explored, including: testing at grades 10 or 11, following up on high school dropouts to include them in NAEP's samples, and gathering data on the achievement of high school students primarily through NCES's longitudinal surveys.

Recommendation 2D. Coordinate the sampling and administrative procedures for national and state NAEP in order to reduce burden and decrease costs.

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×

Recommendation 2E. The development of clear, comprehensible, and well integrated reports of NAEP results should remain a high priority, and reports should be redesigned to reflect NAEP's streamlined designs.

Recommendation 2F. In order to accomplish the recommendations listed above, NAEP's research and development agenda should emphasize the following:

  • Estimation of the effects of differences in sample definition, content, task types, and administration procedures for trend NAEP and main NAEP with subsequent derivation of links to support the use of a single trend line in each discipline

  • Estimation of the effects of the differences in participation rates, administration procedures, and bias for state and national NAEP with subsequent development of more efficient sampling procedures

  • Exploration of alternatives for obtaining meaningful data from high school students, and

  • Development of clear, comprehensible reports and reporting metrics that provide descriptive, evaluative, and interpretive information in a carefully articulated and described report series.

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 56
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 57
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 58
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 59
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 60
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 61
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 62
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 63
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 64
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 65
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 66
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 67
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 68
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 69
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 70
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 71
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 72
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 73
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 74
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 75
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 76
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 77
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 78
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 79
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 80
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 81
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 82
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 83
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 84
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 85
Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.
×
Page 86
Next: 3 Enhancing the Participation and Meaningful Assessment of all Students in NAEP »
Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress Get This Book
×
Buy Hardback | $55.00 Buy Ebook | $43.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Since the late 1960s, the National Assessment of Educational Progress (NAEP)—the nation's report card—has been the only continuing measure of student achievement in key subject areas. Increasingly, educators and policymakers have expected NAEP to serve as a lever for education reform and many other purposes beyond its original role.

Grading the Nation's Report Card examines ways NAEP can be strengthened to provide more informative portrayals of student achievement and the school and system factors that influence it. The committee offers specific recommendations and strategies for improving NAEP's effectiveness and utility, including:

  • Linking achievement data to other education indicators.
  • Streamlining data collection and other aspects of its design.
  • Including students with disabilities and English-language learners.
  • Revamping the process by which achievement levels are set.

The book explores how to improve NAEP framework documents—which identify knowledge and skills to be assessed—with a clearer eye toward the inferences that will be drawn from the results.

What should the nation expect from NAEP? What should NAEP do to meet these expectations? This book provides a blueprint for a new paradigm, important to education policymakers, professors, and students, as well as school administrators and teachers, and education advocates.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!