Read "Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress" at NAP.edu

Page 56 Cite

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.

×

2 Streamlining the Design of NAEP

Summary Conclusion 2. Many of NAEP's current sampling and design features provide important, innovative models for large-scale assessments. However, the proliferation of multiple independent data collections—national NAEP, state NAEP, and trend NAEP—is confusing, burdensome, and inefficient, and it sometimes produces conflicting results.

Summary Recommendation 2. NAEP should reduce the number of independent large-scale data collections while maintaining trend lines, periodically updating frameworks, and providing accurate national and state-level estimates of academic achievement.

INTRODUCTION

NAEP provides important information about the academic achievement of America's youth, and the assessment has many strong design features. For example, NAEP's sampling, scaling, and analysis procedures serve as important models for the measurement community. The frameworks and innovative assessment materials serve as guides for state and local standards and assessment programs, and state NAEP results provide a useful backdrop for state and local assessment data.

In this chapter, we describe and evaluate NAEP's current sampling, data collection, analysis, and reporting methods. As background, we review the current

Page 57 Cite

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.

×

NAEP assessments, the sampling designs and analysis methods used, and the reports generated. We then briefly review the findings of previous evaluations and provide our own evaluation of the strengths and weaknesses of the current design. Our conclusions lead us to recommend strengthening NAEP's design and increasing its usefulness. We argue for reducing the number of independent large-scale data collections currently carried out. We discuss and provide proposals for:

Combining the trend NAEP and main NAEP designs in core subjects to preserve measurement of trends and allow updating of frameworks;
Using more efficient sampling procedures for national NAEP and state NAEP in order to reduce the burden on states and schools, decrease costs, and potentially improve participation rates;
Using multiple assessment methods to assess subject areas for which testing frequency generally prohibits the establishment of trend lines;
Exploring alternatives to the current assessment of twelfth graders by NAEP with the goal of minimizing bias associated with differential dropout rates and the differing course-taking patterns of older students, encouraging student effort, and expanding assessment domains to include problem solving and other complex skills critical to the transition to higher education, the workplace, and the military; and
Improving NAEP reports by providing (1) descriptive information about student achievement, (2) evaluative information to support judgments about the adequacy of student performance, and (3) contextual, interpretive information to help users understand students' strengths and weaknesses and better investigate the policy implications of NAEP results.

OVERVIEW OF NAEP'S CURRENT SAMPLING, DATA COLLECTION, ANALYSIS, AND REPORTING PROCEDURES

The National Assessment of Educational Progress is mandated by Congress to survey the academic accomplishments of U.S. students and to monitor changes in those accomplishments over time. Originally, NAEP surveyed academic achievement and progress with a single assessment; it has evolved into a collection of assessments that now includes the trend NAEP and main NAEP assessments. Main NAEP has both the national and state components. National NAEP includes the large-scale survey assessments and a series of special studies that are not necessarily survey-based. Special studies generally focus on specific portions of NAEP's subject domains and on the associated teaching and learning data. Current NAEP is described in the Introduction; Figure I-1 shows the components of the current program.

Page 58 Cite

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.

×

Components of Current NAEP

The primary objective of trend NAEP is to provide trend lines of educational achievement for the U.S. population and major population subgroups over extended time periods. To avoid disruptions in trend lines caused by differences in NAEP administration or content, administration procedures and assessment items for trend NAEP are held as constant as possible over time.

Main NAEP is a larger assessment program than trend NAEP; it provides more precise estimates of educational achievement in population subgroups, includes more contextual variables, and is based on frameworks that are updated on a regular basis to reflect changes in curriculum and pedagogical thought. Again, main NAEP includes both national and state components. The state data collections are structured to provide estimates with adequate degrees of precision for individual states.

Tables 2-1 through 2-3 summarize the administrations of current NAEP since 1984, with assessments based on the same frameworks indicated by the same symbol and joined by lines to indicate whether trend estimation is feasible. Note that, in addition to the trend lines established using trend NAEP, short-term trend lines for main NAEP have been established in reading in national NAEP (grades 4, 8, and 12) and state NAEP (grade 4) from 1992 to 1998. Short-term trend lines from 1990 to 1996 have also been established in mathematics in national NAEP (grades 4, 8, and 12) and state NAEP (grade 8; and for 1992-1996, grade 4). However, as noted previously, the short-term trend lines of national NAEP and state NAEP reflect different assessment materials and student samples than does trend NAEP.

NAEP's multiple assessment programs evolved to preserve trend lines, at the same time allowing for updating of NAEP frameworks, and to obtain state-level NAEP estimates in main NAEP. The distinct programs allow the objectives of each component to be achieved without compromising the aims of the others. However, it may be unnecessary to have separate assessment programs with such similar objectives. Later in this chapter, we consider whether there is a compelling need for distinct assessment programs or whether these activities could be merged.

Sampling Designs for Current NAEP

The NAEP program differs fundamentally from other testing programs in that its objective is to obtain accurate measures of academic achievement for populations of students rather than for individuals. This goal is achieved using innovative sampling, scaling, and analysis procedures. We discuss these procedures next. Note that their description and evaluation is reliant on technical terminology that is difficult to translate into nontechnical terms. Technical language

Page 59 Cite

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.

×

TABLE 2-1 NAEP Frameworks, Designs, and Samples by Discipline and Year Tested: Reading and Writing

Page 60 Cite

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.

×

TABLE 2-2 NAEP Frameworks, Designs, and Samples by Discipline and Year Tested: Science and Mathematics

Page 61 Cite

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.

×

TABLE 2-3 NAEP Frameworks, Designs, and Samples by Discipline and Year Tested: Geography, History, and Civics

Page 62 Cite

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.

×

is used in this chapter in a way that is atypical of the remainder of this report.

NAEP tests a relatively small proportion of the student population of interest using probability sampling methods. Constraining the number of students tested allows resources to be devoted to ensuring the quality of the test itself and its administration, resulting in considerably better estimates than would be obtained if all students were tested under less controlled conditions. The use of sampling greatly reduces the burden placed on students, states, and localities in comparison to a national testing program that tests a substantial fraction of the nation's children.

The national samples for main NAEP are selected using stratified multistage sampling designs with three stages of selection. The samples since 1986 include 96 primary sampling units consisting of metropolitan statistical areas (MSAs), a single non-MSA county, or a group of contiguous non-MSA counties. About a third of the primary sampling units are sampled with certainty, and the remainder are stratified and one selected from each stratum with the probability proportional to size. The second stage of selection consists of public and nonpublic schools within the selected primary sampling units. For the elementary, middle, and secondary samples, independent samples of schools are selected with probability proportional to measures of size. In the final stage, 25 to 30 eligible students are sampled systematically with probabilities designed to make the overall selection probabilities approximately constant, except that more students are selected from small subpopulations, such as private schools and schools with high proportions of black or Hispanic students, to allow estimates with acceptable precision for these subgroups. In 1996 nearly 150,000 students were tested from just over 2,000 participating schools (Allen et al., 1998a).

The sampling design for state NAEP has only two stages of selection —schools and students within schools—since clustering of the schools within states is not necessary for economic efficiency (Allen et al., 1998b). In 1996 for each state, approximately 2,000 students in 100 schools were assessed for each grade. Special procedures were used in states with many small schools for reasons of logistical feasibility.

The national and state designs limit students to one hour of testing time, since longer test times are thought to impose an excessive burden on students and schools. This understandable constraint limits the ability to ask sufficient questions in the NAEP subject areas to yield accurate assessments of ability for individual students or subareas in a discipline. Time limits and NAEP's expansive subject-area frameworks have led to students receiving different but overlapping sets of NAEP items, using a form of matrix subsampling known as balanced incomplete block spiraling. The data matrix of students by test questions formed by this design is incomplete, yielding complications for the analysis. The analysis is currently accomplished by assuming an item response theory model for the items and drawing multiple plausible values of the ability parameters for sampled

Page 63 Cite

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.

×

students from their predictive distribution given the observed data (Allen et al., 1998a).

The school and student sampling plan for trend NAEP is similar to the design for national NAEP. Schools are selected on the basis of a stratified, three-stage sampling plan with counties or groups of contiguous counties defined by region and community type and selected with probabilities proportional to size. Public and nonpublic schools are then selected. In stage three, students within schools are randomly selected for participation. Within schools, students are randomly assigned to either mathematics/science or reading/writing assessment sessions, with item blocks assigned using a balanced, incomplete design. In 1996, between 3,500 and 5,500 students were tested in mathematics and science and between 4,500 and 5,500 in reading and writing (Campbell et al., 1997).

Analysis Methods for Current NAEP

Standard educational tests generally involve a large enough set of items to allow an individual student's proficiency on a tested topic to be captured with minor error from a simple summary, such as a total score or average test score. Since everyone takes the same test (or if different versions are used, the alternatives are carefully designed to be parallel), scores from different students can be compared directly and distributions of ability estimated. It was found that these simple approaches to analysis did not work well for the NAEP assessments since the tests are short, and they contain relatively heterogeneous items so that, in combination, multiple test forms capture NAEP subject areas adequately. As a result, simple summary scores for NAEP have sizable measurement error, and scores from different students can vary significantly because of differences in the items appearing on individual test forms.

The analysis for main NAEP and trend NAEP needs a glue in order to patch together results from heterogeneous forms assigned to heterogeneous students into clear pictures of educational proficiency. The glue of current NAEP analysis is supplied by item response theory modeling (IRT), which captures heterogeneity in items through item parameters and heterogeneity between students through individual student proficiency parameters. The basic forms of IRT used are the three-parameter logistic model (Mislevy et al., 1992) for multiple-choice or other right/wrong items and the generalized partial credit model of Muraki (1992) for items for which more than one score point is possible. Parameters are estimated for sets of homogeneous items by the statistical principle of maximum likelihood using the NAEP bilog/parscale program, which accommodates data in the form of the matrix samples collected (Allen et al., 1998a). A variety of diagnostic checks of these models are carried out, including checks of the homogeneity of the items (unidimensionality), goodness of fit of the models to individual items, and checks of cultural bias suggested by residual subgroup differences for students with similar estimated proficiencies.

Page 64 Cite

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.

×

The IRT models relate main NAEP and trend NAEP items to a set of K scales of unobserved proficiencies (Allen et al., 1998a). Each sample individual j is assumed to have a latent (K × 1) vector of unobserved proficiencies Θ_j, the values of which determine the deterministic component of responses to items related to each scale. Given the estimates of item parameters, the predictive distribution of each individual student's Θ_j can be estimated based on the observed performance on the items. This predictive distribution is multivariate and conditioned on the values of fixed background variables characterizing the student. For each student j, five sets of plausible values (Θ_j1,…,Θ_j5) are drawn from this predictive distribution. Five sets are drawn to allow the uncertainty about the latent proficiencies, given the limited set of test questions, to be reflected in the analysis. This step is an application of Rubin's (1987) multiple imputation method for handling missing data and is called the plausible values methodology in the NAEP context (Mislevy, 1985). Once plausible values are imputed for each individual student on a common scale, inferences can be drawn about the distribution of proficiencies, and proficiencies can be compared between subgroups and over time. For main NAEP, cutscores along the proficiency scales can also be determined to reflect levels of performance that are judged to represent basic, proficient, and advanced achievement.

Statistics of interest, such as proficiency distributions for the current NAEP samples and for subgroups defined by demographic characteristics, can be regarded as functions of aggregates of predicted latent proficiencies and student characteristics g(Θ_j,y_j) for each student j. As in the analysis of many probability surveys, sampled individuals who contribute to the aggregate statistics are weighted to allow for differential inclusion probabilities arising from sample selection, unit nonresponse adjustments, and poststratification. The sampling variance of estimates, initially ignoring uncertainty in the Θ_j, is computed by jackknife repeated replication, an established method for computing sampling errors from surveys that take into account the stratification, clustering, and weighting of the complex sample design (Kish and Frankel, 1974). The uncertainty in the Θ_j is then incorporated by adding to the average jackknife sampling variance of the statistic computed for each set of plausible values {Θ_{(j,k):k = 1,…,j}}, a component of imputation variance based on the variability of the estimates computed from each set of plausible values. This computation is an application of Rubin's (1987) multiple imputation method.

NAEP Reporting

From the program's inception, NAEP has had the goal of reporting results in formats that are accessible to potential users, promote valid interpretations, and are useful to NAEP's varied constituencies. The NAEP program currently produces an impressive array of reports, including:

Page 65 Cite

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.

×

Report Cards. These are the primary reports of the results of main NAEP. Results are presented for the nation, for states (if applicable), for major demographic groups, and in relation to key context variables (e.g., for public and private schools).
State Reports. These report results from main NAEP, with a report tailored specifically for each participating state.
Focus on NAEP/NAEP Facts. These are two-or four-page mini-reports that summarize NAEP frameworks, assessment results, and address topics of current and special interest.
Instructional Reports. These show performance data in relation to instructional background variables; they are issued 6 to 12 months after the Report Cards.
Focused Reports. These contain NAEP results from the special studies component of main NAEP (e.g., on the performance of English-language learners and students with disabilities or on special features of the assessments). These are also issued 6 to 12 months after the Report Cards.
Trends in Academic Progress. This is the primary report of the results of trend NAEP.

This differentiated product line is intended to serve a variety of audiences, with differing information needs, interest in findings, and sophistication in interpreting results.

SELECTED FINDINGS FROM PREVIOUS NAEP EVALUATIONS

Components of Current NAEP

Again, main NAEP and trend NAEP test different student populations and use distinct assessment exercises and administration procedures. The national and state components of main NAEP also use different administration procedures. There is a good deal of sympathy among policy makers, testing experts, and NAEP's evaluators for the need to streamline NAEP's designs (National Academy of Education, 1996, 1997; Forsyth et al., 1996).

NAEP's policy board, the National Assessment Governing Board (NAGB), has expressed concern over the inefficiency of maintaining main NAEP and trend NAEP; they recently announced plans to investigate more efficient design options. They said ''[it] may be impractical and unnecessary to operate two separate assessment programs.'' They have called for a "carefully planned transition … to enable the main National Assessment to become the primary way to measure trends in reading, writing, mathematics, and science in the National Assessment program" (National Assessment Governing Board, 1996:10). NAGB also registered concern about the inefficiency and burden imposed on states by separate state and national NAEP data collections. To address this concern for future

Page 66 Cite

Suggested Citation:"2 Streamlining the Design of NAEP." National Research Council. 1999. Grading the Nation's Report Card: Evaluating NAEP and Transforming the Assessment of Educational Progress. Washington, DC: The National Academies Press. doi: 10.17226/6296.

×

assessments, NAGB has said that "where possible, changes in national and state sampling procedures shall be made that will reduce burden on states, increase efficiency, and save costs" (National Assessment Governing Board, 1996:7).