Read "Grading the Nation's Report Card: Research from the Evaluation of NAEP" at NAP.edu

« Previous: 8 Issues in Combining State NAEP and Main NAEP

Page 172 Cite

Suggested Citation:"9 Difficulties Associated with Secondary Analysis of NAEP Data." National Research Council. 2000. Grading the Nation's Report Card: Research from the Evaluation of NAEP. Washington, DC: The National Academies Press. doi: 10.17226/9751.

Page 173 Cite

Page 174 Cite

Page 175 Cite

Page 176 Cite

Page 177 Cite

Page 178 Cite

Page 179 Cite

Page 180 Cite

Page 181 Cite

Page 182 Cite

Page 183 Cite

Page 184 Cite

Page 185 Cite

Page 186 Cite

Page 187 Cite

Page 188 Cite

Page 189 Cite

Page 190 Cite

Page 191 Cite

Page 192 Cite

Page 193 Cite

Page 194 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

9 Difficulties Associated with Secondary Analysis of NAEP Data Sheila Barron The National Assessment of Educational Progress (NAEP) has tracked aca- demic achievement for over a quarter of a century, providing some of the best data available on the academic performance of students in America's schools. NAEP began as a relatively simple assessment of student achievement that reported the percentage of students who could correctly answer individual ques- tions. It has evolved into a set of complex assessment systems designed to serve a variety of purposes. One purpose of NAEP is to provide a rich database that can be used by secondary analysts to address important educational issues. There are a number of challenges associated with this function of NAEP. First, the research ques- tions of interest are not set out in advance of the development of the assessments and accompanying questionnaires. Thus, the developers of NAEP must try to anticipate what data will be most useful to secondary analysts and the level of precision needed. Second, providing data for secondary analysis is only one function of NAEP. Thus, the developers of NAEP must try to balance the anticipated needs of secondary analysts with other, sometimes competing, NAEP functions. Third, the data must be provided to researchers in a useable form along with adequate documentation and support. NAEP data are used by researchers with varied backgrounds and interests. Content-area specialists, sociologists, economists, and psychometricians have all wanted to use NAEP to answer important questions in their respective fields. The data each of these groups would like NAEP to provide differ, sometimes dramati- cally. Their knowledge of measurement issues important in understanding the NAEP data also varies considerably as well as their background in statistical 172

SHEILA BARRON 173 analysis and their ability to use large and complex databases. Thus, it is clear that the challenges to developers of NAEP are considerable. Those responsible for the NAEP assessments have tried to meet these chal- lenges by listening to the concerns of secondary analysts and, when possible, making changes to the questionnaires and assessments, the means by which the data are provided, and the NAEP documentation. In addition, they have devel- oped training and special materials for helping researchers use NAEP data. Despite these considerable efforts, researchers continue to have significant prob- lems conducting secondary analyses of NAEP data. This paper addresses the difficulties that researchers encounter when they attempt to use NAEP data and the means by which the National Center for Education Statistics (NCES) and the Educational Testing Service (ETS) have tried, and are trying, to improve the usability of the data. The problems of secondary analysts who use secure NAEP data as well as researchers who use other NAEP data (e.g., published statistics, public release data) were of interest. The information presented in this paper was collected through informal inter- views with a number of NAEP secondary analysts as well as a number of staff members at NCES and ETS. This paper provides an overview of the potential difficulties one may encounter when using NAEP data and makes recommenda- tions for improving the usability of NAEP data for secondary analysis in the future. LITERATURE REVIEW Although this is a topic researchers involved with NAEP talk and commis- erate about often, very little has been written about the difficulties secondary analysts confront. Kenney and Silver (1996) discuss lessons they learned as content experts working with NAEP data. They concluded that the way in which NAEP findings are organized and reported may discourage researchers from using the data specifically, researchers who know little about the complex struc- ture of the assessment but who are experts in curriculum and pedagogy. In addition, they found that the actual student responses to the extended constructed- response questions, a potentially rich source of information, were useable only by investing amounts of time and money that would be prohibitive to most researchers. Although few articles have been written that specifically address the difficul- ties of using NAEP data, information about these difficulties can occasionally be found in papers in which analyses of NAEP data are reported. Lee et al. (1997) used NAEP data to look at the effect of high school course offerings on equity and student achievement. They included a candid summary of the difficulties they encountered using NAEP data. Specifically, they did not find many of the relationships between background variables and student achievement that are routinely observed in data on student achievement. They concluded that the

174 DIFFICULTIES ASSOCIATED WITH SECONDARY ANALYSIS OF NAEP DATA outcome variable of interest in their study students' mathematics achievement- was flawed in that particular iteration of the NAEP survey because of the condi- tioning model used to scale the data. However, this was not initially apparent, and Lee et al. reached this conclusion only after poring over detailed technical documentation of the NAEP scaling procedures. OVERVIEW OF NAEP NAEP is not a single assessment but a system of assessments. The main NAEP assesses students in a small number of subjects approximately every two years. The contents of the various assessments reflect current thinking about what students should know and be able to do. Samples of students in grades 4,8, and 12 from across the country are tested using both multiple-choice and con- structed-response questions. In a given subject area, not all students respond to the same set of questions. A large number of test booklets are constructed in which test questions are grouped into blocks and blocks are assembled into booklets. The main NAEP assessment is divided into a national administration and a state-level administration. In the state-level administration, students are sampled from participating states at rates that allow for accurate estimates of the distribution of proficiency at the state level. The long-term trend NAEP assesses students in reading, mathematics, sci- ence, and writing approximately every two years. The content of the assessment has been the same since the mid-1980s. Samples of students ages 9, 13, and 17 from across the country are tested using primarily multiple-choice questions in all subjects except writing. As in the main NAEP, in a given subject area, not all students respond to the same set of questions. However, the structure of the trend NAEP is less purposeful than that of the main NAEP. The trend NAEP came into being after problems resulted from trying to link the 1986 assessment in reading to the 1984 assessment. The anomalous results were concluded to be due to changes in the measurement conditions (i.e., timing and item order) across the two assessments (Beaton and Zwick, 1990~. The decision was made to create a trend assessment in which consistency over time was rigidly maintained. A small number of booklets from the 1984 or 1986 administration (depending on the subject area) were chosen for use in the trend assessment, and these booklets have been used in all subsequent administrations of the trend assessment. Scaling and reporting of the data are similar for both sets of assessments. Item response theory (IRT) procedures are used to estimate item characteristics (e.g., difficulty, discrimination). The resulting item parameter estimates are used along with the item responses and background information collected on the examiners to estimate the distribution of student proficiency. Because it is the distribution of student proficiency that is of interest rather than estimates of proficiency for individuals, individual scaled scores are not generated. Rather,

SHEILA BARRON 175 five plausible values are generated that are based on the distribution of possible scaled scores for the individual. NAEP periodically reports the results of its main, state, and trend assess- ments. The primary results that are presented are averages for the population as a whole and for important subgroups (i.e., Hispanics) as well as the percentage of students reaching various performance standards. In addition to asking students to respond to test questions, students are asked a number of background ques- tions. Data are also collected on schools and teachers. METHODS This paper has three objectives: (1) to outline the difficulties secondary analysts have in using NAEP data; (2) to discuss the means by which NCES and ETS have attempted to address these problems; and (3) to develop recommenda- tions for improving the usability of NAEP data. To outline the difficulties that secondary analysts have using the data, inter- views were conducted (either by phone or e-mail) with researchers who have conducted secondary analyses of NAEP data or who have received training on secondary analysis of NAEP data but have not used the data outside the training. The pool of potential interviewees came from a number of sources: a list of members of the American Educational Research Association's (AERA) Special Interest Group on Research Using NAEP Data; researchers who have received secondary analysis grants from NCES; a list of attendees at a 1996 NAEP training session sponsored by NCES; first authors of papers involving secondary analysis of NAEP data presented at the 1997 AERA national conference; and referrals from other researchers. It is difficult to identify the total number of researchers in the pool of poten- tial interviewees because the names provided by these various sources over- lapped to a degree. In addition, a small number of the people on these lists were employees or former employees of NAEP contractors, who likely would have greater familiarity with the data than a typical secondary analyst and thus may not have encountered the same problems using the data as the typical secondary analyst. A rough estimate of the number of potential interviewees in the resulting pool is 80 to 90. Only a subset of the researchers in the pool of potential interviewees were contacted. Time did not permit a full-scale mail survey. In addition, phone numbers or e-mail addresses were not provided for most of the people. A total of 43 researchers were contacted by phone or e-mail and asked to respond to a series of questions about their experiences with NAEP data. Fourteen researchers provided answers to the questions. Some went into great detail about their expe- riences and the strengths and weaknesses of the NAEP data, whereas others provided more cursory responses. The researchers varied widely in the depth of their experience with NAEP data. Several researchers could be called repeat

176 DIFFICULTIES ASSOCIATED WITH SECONDARY ANALYSIS OF NAEP DATA users, but only two were involved in studies that delved into a number of different issues. The researchers came from a variety of backgrounds and differed in their research objectives. The largest proportion was interested in using NAEP data to model the effects of various student and school characteristics on student profi- ciency. Typical of this type of research was a study that sought to explain group differences in performance using information provided in the NAEP background variables. The types of analyses conducted by these researchers were typically regression based. Hierarchical linear modeling (HLM) was mentioned by most researchers. For the most part, these analysts used data from the background variables and the plausible values. A number of the researchers interviewed for this paper were involved in research that looked at measurement issues. Two researchers were interested in the validity of background variables; two were interested in linking state NAEP data to data from a state testing program; two were interested in the dimensionality of the data; another conducted research looking at the impact of motivation on performance. The data used and the types of analyses conducted by these researchers varied widely. There were only a couple of researchers who were interested in the content of NAEP in a particular subject area. One was a content expert who was inter- ested in extracting information about what students can do in different areas of mathematics; the other was interested in the content validity of the science assess- ment. The data of interest to these researchers differed substantially from other researchers. It was not necessary to run any analyses on the data files to obtain the information needed for this research. What was needed was access to the NAEP items and student responses as well as information about item-level per- formance that could be obtained from the published statistics. In addition to secondary analysts, five researchers who underwent training to use NAEP data but who had not yet done any NAEP analyses were interviewed. These research- ers were asked about the training they attended and about why they had not used NAEP data following the training. In addition to collecting information from secondary analysts, three people who work on NAEP two from ETS and one from NCES and are knowledge- able about issues concerning secondary analysis of NAEP data were interviewed. They were asked about training opportunities and other assistance available to secondary analysts as well as other efforts NCES and ETS have made to facilitate NAEP research. Because redesign of NAEP is currently being considered, spe- cial attention was given to the implications that possible changes in NAEP could have for secondary analysts. Recommendations were developed based on the comments by secondary analysts and NAEP staff. Some of the recommendations came directly from secondary analysts; others were developed by looking for practical ways to ameliorate the problems that secondary analysts reported.

SHEILA BARRON 177 RESULTS Overall, NAEP secondary analysts were very positive about the training they attended, special computer programs written to facilitate use of the data, and the helpfulness of ETS and NCES staff. Secondary analysts had positive and nega- tive things to say about NAEP documentation. Comments about getting access to the data and the complexity of the data were largely negative. All but one of the researchers interviewed described at least one problem that he or she encountered when trying to analyze the NAEP data. This researcher said that the NAEP system is very complex and takes a lot of effort to understand. He went on to say that he encountered many challenges with the data but no problems. The median number of problems reported was three, and the maximum was seven. Discussions with NAEP secondary analysts identified six areas of concern: (1) obtaining the data, (2) timing of the availability of data, (3) complexity of the methodology used in NAEP, (4) form and organization of the data, (5) documen- tation, and (6) getting help. These six areas are not completely independent- that is, difficulties in one area often impact other areas. Obtaining NAEP Data from NCES Researchers who wanted data that are only available on secure data files reported difficulties obtaining data in the first place. NAEP data come in several forms: published reports and data compendiums, public-release data files (for assessments before 1990), and secure data files. To obtain secure NAEP data, a site license is needed. The procedure for obtaining a license can be arduous, especially in large organizations such as universities and state government agen- cies. For example, one researcher needed the signature of the state attorney general in order to obtain the data at the university where she was a graduate student. This proved especially difficult as it was an election year and the attor- ney general was busy campaigning. Several researchers who attended training seminars on NAEP but who had not done any research using the data said that, although the NAEP data are very relevant to their research interests, they have not used the data because of how difficult it is to obtain a site license. One researcher said that the reason his university did not have the data was because the procedure is so complicated and takes an overwhelming amount of paperwork and nobody was willing to get involved in it. The difficulties associated with obtaining a site license stem from concerns the government has that the information provided to researchers about schools and students be used properly and not released to the public. The poten- tial exists when using secure data for a researcher to use information provided about schools to identify individual schools and to use these data to the detriment of the school. Researchers commented that they felt these security precautions are "far too extreme" and that the risks have been "overdramatized." In addition,

178 DIFFICULTIES ASSOCIATED WITH SECONDARY ANALYSIS OF NAEP DATA one researcher indicated that the data he required did not need to be secure but because the data were available only on the secure files he had to go through the whole process of getting secure data. There do not appear to have been any efforts by NCES to make the process of getting a site license easier. However, because of the way the law is written, there may be little that NCES can do to change the process. According to NAEP staff, efforts are currently under way to explore, once again, providing public-use files that would not require a site license. These files would have the information that makes districts and schools individually identifiable removed. However, to make these files available to secondary analysts, the law authorizing NAEP would need to be changed. Although no problems concerning getting access to the published data were reported, NCES is making efforts to make it even easier to obtain those types of data. The NCES Web site (www.nces.ed.gov/NAEP) currently has many reports and data compendiums available for downloading. In addition, there is an exten- sive catalog of NAEP publications and data products with information for placing orders. Timing of the Availability of NAEP Data Several researchers were unhappy about the long lag between the adminis- tration of NAEP and the data being available for secondary analysts. It takes about a year from the administration of NAEP for the results to be released to the public. It takes much longer for the data and the accompanying technical docu- mentation to be available for secondary analysts. According to the NCES Web site, the technical report for the 1996 science assessment was released in January 1998, almost two years after the assessment was administered. When researchers were interviewed for this paper in January 1998, several complained that they were still waiting for data from the 1996 NAEP assessment. One researcher commented that he did not understand why it took so long that commercial test publishers, albeit with simpler systems, get results out in six weeks. According to NAEP staff, their current priority is to make assessment results available to the public in as timely a fashion as possible. Other data (e.g., questionnaire data, special studies data) and technical documentation are not an initial priority. NCES and ETS have been making efforts to decrease the time between administration of NAEP and the release of results. These efforts, if successful, would conceivably have a positive impact on the timing of data avail- ability for secondary analyses. However, these efforts have not been very suc- cessful. Because of the complexity of NAEP and the need for hand scoring of constructed-response questions, a great deal of work must be done before results can be reported. In addition and largely because of the complexity of NAEP, problems have occurred in scaling the data. These problems cause additional

SHEILA BARRON 179 delays or, in the worst cases, reanalysis of the data and a modification of results that have already been released. Complexity of the NAEP Data Issues stemming from the complexity of NAEP permeate many people's statements about using the data. A number of researchers made remarks that speak to the complexity of NAEP as a whole. For example, one researcher commented on how difficult it is to make sure one is doing the analyses correctly. He pointed out that with NAEP data there are many opportunities to make mis- takes. Another researcher reported obtaining anomalous results and not being able to discover why. He commented that NAEP is so complicated that even with top-notch psychometricians working on the project they could not figure out whether the anomalous results were real or were an artifact of the NAEP data. Researchers commented about a number of specific aspects of the NAEP design and methodology that they thought contributed to its complexity. Aspects of NAEP that researchers discussed were clustered sampling, BIB (balanced incom- plete block) spiraling, conditioning student achievement on background informa- tion, and plausible values. Clustered Sampling Clustered sampling is often an issue in research involving education. Because students are grouped into classrooms and schools, the students in a given group usually look more alike than a random sample of students. This issue impacts NAEP analyses because examiners are chosen for NAEP by first drawing a sample of schools and then a sample of students from within the chosen schools. Thus, the assumption of most standard statistical tests that instances of mea- surement be independent is violated. Because this assumption is violated, special methodology is required to compute estimates of sampling error in NAEP. There are two methods recommended in the NAEP documentation: design effects and jackknifing. Using design effects is relatively simple but gives only crude estimates of the standard errors. In this method, standard errors computed by using formulas for independent observations are inflated using a design effect. This design effect is an estimate of how large the impact of clustering (and other sources of dependency among the observations) is on the sampling error variance. Estimates of the design effects are provided to the secondary analyst in the NAEP documentation. Jackknife estimates of standard errors are much more precise but historically have been difficult for many secondary analysts to compute. Using NAEP data, computing jackknife standard errors is accomplished by repeating the analysis of interest once for each set of jackknife weights typically there are 62 sets. The variance of the jackknife estimates is the estimate of the sampling error variance.

180 DIFFICULTIES ASSOCIATED WITH SECONDARY ANALYSIS OF NAEP DATA Several researchers thought that the need to use a special procedure to compute standard errors was a hindrance to their research. One reported not undertaking analyses because of the lack of an easy way to compute standard errors. Design effects were thought to be too imprecise and jackknifing too labor intensive requiring that each analysis be repeated 63 times (once to get the statistic of interest and 62 times to get the statistics that go into computing the jackknife standard error). There is little chance that sampling for NAEP will change to eliminate the need for special computational procedures for computing standard errors- clustering effects are inherent in educational settings and designing NAEP so that it does not take advantage of the grouping of students into schools, and larger aggregations would be prohibitively expensive. In addition, there are many types of analyses in which the clustered nature of the sample is desirable (e.g., examin- ing school effects). However, there are other ways to facilitate analysis of NAEP data, and this is one example of how NCES and ETS have listened to the prob- lems of secondary analysts and worked to find solutions. There is now SPSS and SAS code included with the data that can be used to compute jackknife standard error estimates. BIB Design Most traditional assessments involve either all students taking a single form of the test or students taking one of a small number of forms of the test where the different forms are designed to be as parallel as possible. These designs, either a single form or multiple parallel forms, have important advantages when the purpose of measurement is to provide precise estimates of achievement for individuals. The purpose of NAEP is quite different to provide precise estimates of the distribution of achievement in important populations rather than estimates for individuals. Thus, there are other designs that could optimize measurement efficiency. Assessment designs that increase measurement efficiency when indi- vidual scores are not the goal of the assessment generally involve having differ- ent students take different samples of items without trying to make the sets parallel. NAEP takes this approach. Items are bundled into blocks and blocks are then assigned to booklets, creating a large number of "forms" of the assessment or booklets. Each block is bundled with every other block in at least one booklet, allowing for the entire item covariance matrix to be calculated. (Some NAEP assessments use a variation on this design.) In this design, called a balanced incomplete block (BIB) design, no effort is made to make booklets parallel in the traditional measurement sense. What is important is that, given the same testing time and number of students, greater coverage of the content domain can be achieved using a BIB design than traditional designs.

SHEILA BARRON 181 The advantages of a BIB design for measurement efficiency are thought by the designers of NAEP to outweigh the disadvantages. The main disadvantage of a BIB design is that scores for individuals generated by using standard methods (either IRT or classical test theory methods) are likely to have more error in them than is tolerable.) For this reason, NAEP does not report traditional estimates of student proficiency. The implications of this decision for secondary analysis will be considered in the next section, which covers the statistical technique of condi- tioning. A second disadvantage of the BIB design is that analyses based on the item data are made more complicated and more error prone than would be the case with a more traditional design. Several researchers reported that it can take a great deal of effort to understand the structure of the item-level data and to reorganize such data to fit statistical programs that use item-level data. In addi- tion, because only a fraction of the examiners are administered each item, the item-level statistics are not estimated precisely.2 This is especially true if the item statistics of interest are based on a subgroup of the population. NAEP staff reported trying to facilitate secondary analysts by providing information in the NAEP documentation about booklet and block codes needed to understand which items a specific examiner was administered. However, this appears to be the extent of the efforts made to address secondary analysts' con- cerns about the BIB design. Fundamental changes to NAEP have not been made because the advantages of the BIB design for measurement efficiency are widely thought to outweigh the disadvantages. The NAEP redesign will examine this issue and may come up with new alternatives. Conditioning The aspect of the methodology that appears to cause secondary analysts the most concern is conditioning student achievement on background information. The process of conditioning on student background information in NAEP is also called "multiple imputations" or "plausible values methodology." For the pur- poses of this paper the scaling methodology and the resulting plausible values will be discussed separately. This distinction reflects how many secondary ana- lysts look at NAEP: secondary analysts commented on the process by which the {For the most part, increased error is not technically caused by the BIB design. A BIB design allows adequate content coverage while using less testing time, and it is the decrease in testing time (thus a decrease in the information collected from individual students) that causes an increase in error variance. 2This depends, of course, on the number of examiners and the number of blocks of items. In some years and on some assessments, the total number of examiners has been high enough and the number of blocks low enough that the sample size for individual items was quite large.

182 DIFFICULTIES ASSOCIATED WITH SECONDARY ANALYSIS OF NAEP DATA data are scaled, the conditioning, and the "scores" that result from that process, the plausible values. Conditioning is a Bayesian approach to scaling. It uses the information available from the assessment along with other information known about exam- inees to create estimates of proficiency. For example, if a student comes from an advantaged suburban school and reports other things that are associated with high performance but performs poorly on the assessment, his or her estimated profi- ciency (i.e., plausible values) will be higher than an unconditioned estimate of proficiency would be. The assumption is that this student was a victim of mea- surement error and his or her "true proficiency" is more like that of similar students with the same background characteristics. Likewise, a student who did well on the assessment but comes from a disadvantaged school and has other characteristics correlated with low performance would have an estimated profi- ciency lower than an unconditioned estimate would be. According to The NAEP Technical Report (Allen et al., 1996), conditioning on background information results in better estimates of the distribution of profi- ciency for important groups.3 However, it also results in biased estimates of achievement for individual students. This requires researchers to take special precautions to ensure that their analyses and conclusions are not affected by this bias. Conditioning can cause problems for secondary analysts who are interested in modeling the effects of student characteristics on achievement. The problem most widely discussed is a downward bias in the estimates of effects for variables that were not used in conditioning when they are included in an analysis with variables that were used in conditioning. Two researchers reported getting anoma- lous results when modeling the effects of student and school characteristics on student achievement. In addition, other researchers reported being concerned that their results were impacted by this bias. ETS has made efforts to minimize these problems. The 1994 technical report (Allen et al., 1996) states that "the set of variables used [in conditioning was] defined with the aim of holding to low levels secondary biases in analyses involv- ing a broad range of variables not included in the conditioning model." Thus, the problems that researchers reported pertaining to bias in estimates of the effects of variables not included in the conditioning model should be less of an issue in the more recent NAEP assessments. Another problem that conditioning causes for secondary analysts is more fundamental. Researchers with years of experience with NAEP and strong backgrounds in statistics said that they still do not under- stand the methodology used to scale NAEP in anything more than general terms Another advantage of conditioning is that is allows estimates of proficiency to be obtained for individuals who answered all of the items either incorrectly or correctly something that is problem- atic with traditional scaling methods.

SHEILA BARRON 183 and are unsure of the impact the scaling procedures have on analyses they have conducted or wish to conduct. They widely reported being uncomfortable using data in their research when they do not understand the scaling methodology used to generate the data. The NAEP Technical Report (Allen et al., 1996) states that "when the under- lying model is correctly specified, plausible values will provide consistent esti- mates of population characteristics." The impact of the model not being correctly specified has not been well researched and needs to be addressed. The technical report also states that conditioning allows key population features to be estimated consistently even when item booklet composition, format, and content balance change over time (Allen et al., 1996~. However, it is not known to what degree changes to the item booklet composition, format, and content balance also change the degree to which the model has been correctly specified. Such changes may impact the results in unknown ways. There have been enough anomalies in the results to make this a serious concern. ETS has tried to assuage people's concerns about conditioning but has failed, in the opinion of a number of researchers, to provide an adequate explanation of how conditioning impacts the data. ETS has not based its statements about conditioning on research using real data or data simulated to have characteristics similar to NAEP data.4 One researcher went so far as to say that she thought that ETS staff were patronizing and that they used overly abstruse statistical argu- ments. Another analyst characterized ETS's standard response to people's con- cerns about conditioning as "Trust me. It works." Plausible Values Instead of each student being given a score that is the best estimate of his or her "true score" given the information available from the assessment, students taking NAEP are given five plausible values. The plausible values are random draws from each individual's posterior distribution obtained using the informa- tion available on the assessment as well as background information. The advantage of using plausible values is that error due to giving students only a sample of items is incorporated into standard error estimates reported for NAEP. A second advantage is that measurement error is apparent to researchers using the data if you do the analyses once for each plausible value, the results will be slightly different each time and that difference is due to measurement error, whereas with the type of scores given on most assessments, it is possible for researchers to forget that this type of error is present to some degree in all assessments. 4 The theoretical underpinnings of conditioning are spelled out in papers by Rubin (1987) and Mislevy (1991).

184 DIFFICULTIES ASSOCIATED WITH SECONDARY ANALYSIS OF NAEP DATA Plausible values are handled in secondary analyses in the following way: (1) the analysis of interest is performed five times once for each of the five plausible values; (2) the standard deviation of the estimates resulting from the five repeats of the analysis is computed; and (3) this standard deviation provides an estimate of the error in the statistic that is due to sampling of items. For most analyses the use of plausible values means more work for the analyst but what needs to be done is straightforward and, though tedious, relatively easy to accomplish. One researcher commented on this aspect of NAEP by saying simply that it would be nice not to have the plausible values. There are some types of analyses however, where a single best estimate of each student's achievement is really needed. In these cases the use of plausible values by NAEP is a very real barrier to analysis of the data. One example of this is when the purpose of the analysis is to evaluate the scaling methodology used in NAEP. Two researchers reported wanting to do this type of research and being stymied by the lack of a single best estimate of student proficiency. NCES and ETS have made efforts to make computation using plausible values easier for secondary analysts. The data extraction program that accompa- nies the data will automatically handle the plausible values. In addition, software for handling plausible values in HEM analyses was developed using an NCES grant. However, NAEP staff have been less responsive to researchers who need a single optimal proficiency estimate for each examined, and ETS has not made its software available so that secondary analysts could produce their own esti- mates of proficiency.5 Form and Organization of NAEP Data Several problems were reported concerning the form and organization of NAEP data on the data files from researchers interested in item-level data. One researcher was frustrated because scored item data are not available on the data files. He reported that the file contained the item responses and a scoring key but, because of the use of blocks of items contained in different booklets, applying the scoring key to the data was a time-consuming process. Since scored data are needed for many types of analyses, making secondary analysts repeat the scoring appears to be an unnecessary and error-prone burden. Another aspect of the item-level data that was troublesome to some second- ary analysts was the order in which the items are listed on the file. Because of the BIB design, the order of presentation of items differs across examiners. A particular block may be administered first in one booklet and last in another. 5 This author was told by NAEP staff that the first plausible value could be used for analyses requiring a single optimal estimate of proficiency. However, this clearly was not the understanding of the NAEP researchers who encountered this problem.

SHEILA BARRON 185 However, items are presented in only one order on the data files. Thus, the data files present the item responses in an order unrelated to the order in which the items were administered to examiners. If a researcher needs a dataset that has items listed in the order they were administered, it is necessary for him or her to construct it using information about the order in which blocks were presented in different booklets. At this time, ETS does not provide scored item-level data. However, ETS does provide SAS and SPSS code for scoring the data and has reported being interested in adding the ability to score the data to the data extraction program. This is not a case where researchers are asking ETS and NCES to eliminate one way of doing things and go with a different way. Rather, researchers would like multiple presentations of the data to be available so that, depending on the purpose of their research, they could use the presentation most suited to their analysis. Providing both scored and unscored data to researchers would cost money by increasing the size of the data files. It would save money for secondary analysts (some of which comes through NCES in the form of research grants). Thus, this is an area where the tradeoffs need to be explored further. Documentation Issues related to NAEP documentation were mentioned frequently. Re- searchers rely heavily on the NAEP data reports (e.g., Mullis et al., 1993) and technical manuals (e.g., Johnson and Allen, 1992) to understand the data. Both positive and negative comments about documentation were common. On the positive side, a number of researchers said that NAEP is well documented. One researcher said the technical reports were her "bible" and that she relied on them heavily in doing her analyses. Another researcher said that, compared to the documentation he had seen for other large-scale assessments, NAEP documenta- tion was very good. On the negative side, researchers reported spending a great deal of time "slogging" through the technical manuals trying to understand the data and make sure they were doing the analyses correctly. Two researchers with a great deal of experience with large databases indicated that NAEP technical documentation was much more difficult to use than documentation for other NCES databases. One researcher reported that it was the descriptions of the actual data, not the technical information, that was not as clear as in other NCES programs. Several researchers indicated that they would like the documentation to provide more examples of how data analysis should be conducted. As one researcher put it "I think [researchers] can generalize from examples better than they can from abstract recommendations." One researcher was concerned about the usability of the documentation by content experts who may not be using the data files but who need to get item information directly from the NAEP documentation. She thought that informa

186 DIFFICULTIES ASSOCIATED WITH SECONDARY ANALYSIS OF NAEP DATA lion about items in the NAEP documentation is not well organized and that the documentation is difficult for content experts to use because information is scat- tered in many different places. NCES and ETS are continually trying to improve their documentation, and it seems clear that over time there have been improve- ments. One researcher pointed out that because of the complexity of NAEP it is more difficult to explain the methodology and data than is the case for other assessments. Getting Help The researchers I spoke with had very positive things to say about the staff of NCES and ETS. One researcher called the people from NCES, ETS, and other organizations who conducted the NAEP training seminars "wonderful, helpful, and knowledgeable." However, several researchers reported that it is often diffi- cult to get answers to questions about NAEP, especially answers to questions that are technical in nature or that relate to assessments other than the main NAEP or the state assessment. Two difficulties mentioned were related to getting help with NAEP analysis problems. First, it is often not clear who one should contact in order to get a question answered. One researcher recommended there be a staff person whose job it is to understand the data and answer secondary analysts' questions or at least be able to route a caller to someone who would be able to answer the question. Second, the system is so complex that it is difficult to fully understand it or keep track of what has been done, even for people who work on NAEP full time. Thus, even though NCES and ETS staff try to be helpful, there are ques- tions that arise that are not familiar to most NAEP staff. Because of the complexity of NAEP data and the special procedures needed to properly analyze the data, NCES and ETS offer training seminars to people interested in conducting research using NAEP data. Overall, researchers thought the training they received was very helpful. They liked the opportunity to meet and interact with NAEP staff. They liked getting hands-on experience with the data and being able to try out special software written to help secondary analysts with the special features of NAEP. One researcher reported that the training was most helpful in describing the data and identifying the problems one would likely encounter. Most but not all comments about training were positive. On the negative side, one researcher reported that the least helpful aspect was providing a model for how to correctly analyze the data. The same researcher thought that the training he received was insufficient to allow researchers to conduct method- ologically sound research. Another researcher wanted training to be more geared to specific audiences; because of the varied backgrounds of the researchers in his training seminar, the leaders did not go into the depth about technical issues that he would have liked.

SHEILA BARRON 187 Overall, NAEP staff should be applauded for their efforts to help researchers deal with the complexity of the data. Many of the researchers I spoke with had benefited from training programs offered by NCES and ETS. Also, including SPSS and SAS code along with the data has improved the usability of the data. In addition, two researchers I spoke with received funding from NCES to develop special software for secondary analysts to use software that would automati- cally handle things like plausible values and jackknife weights. Summary of Results The comments of secondary analysts ran the gamut from high praise to severe criticism. Only one researcher reported having no problems conducting secondary analyses of NAEP data. At the other extreme, there was one researcher whose experience with the data was so negative that she swore she would never use NAEP data again. In addition, there were researchers who had not used NAEP data due to the complexity of the data and/or difficulties associated with getting a site license. The comments differed among types of researchers the comments of content experts differed somewhat from those of sociologists, and sociologists' concerns differed to some extent from those of psychometricians. In addition, comments tended to reflect the time when the research was con- ducted. Over the years a number of changes have been made to the NAEP assessments, and software has been developed to help secondary analysts use the data. Thus, some concerns raised by secondary analysts would be less problem- atic if the research were conducted today. Several researchers commented that some of the difficulties they encountered may be unavoidable in an assessment like NAEP, whereas others were adamant that there is no reason for the system to present such difficulties. RECOMMENDATIONS Before making recommendations it is important to highlight that there are tradeoffs involved in almost any decision that is made regarding NAEP. Creating data that secondary analysts can use to address important issues in education is only one of the purposes of NAEP. Thus, where these recommendations conflict with what is best from the perspective of another purpose of NAEP, only the policy makers charged with balancing these purposes can judge what is best for the overall program. If early reports about the NAEP redesign are true, use of NAEP data by secondary analysts is at risk of becoming a low priority in the new NAEP assess- ment system. Obviously, secondary analysts are opposed to this. NAEP data are currently a unique resource for addressing important questions in education. The primary analyses of NAEP data are generally restricted to descriptive statistics (e.g., means and percentages), and thus it is generally up to secondary analysts to

188 DIFFICULTIES ASSOCIATED WITH SECONDARY ANALYSIS OF NAEP DATA more deeply mine the data. NAEP has the potential, with changes designed to facilitate secondary analysis of the data, to be an even more important resource for addressing issues in education. There are a number of recommendations to facilitate secondary analysis of NAEP data that resulted from interviews with NAEP secondary analysts and NAEP staff at NCES and ETS. Several of them would be relatively easy to implement. Others involve simplifying the NAEP design and scaling procedures. Relatively Easy-to-Implement Changes Improve Communication Between NAEP Staff and Secondary Analysts First, better communication should be established between NAEP staff and secondary analysts. A number of researchers commented that, although NAEP staff are very helpful, there are no automatic lines of communication to help researchers and it is often difficult to get in touch with NCES staff. For example, one researcher commented that NAEP staff should be more proactive about pro- viding researchers with information. She said that when there was an error in the data from one year data that NCES had records of her obtaining she was not notified of the error. It was up to her to find out about the error and request that NAEP send revised data files. A couple of researchers wanted more advertising of NAEP training opportunities. Some of the comments made by secondary analysts illustrate the poor com- munication lines. One researcher requested there be an 800 number that second- ary analysts can call to have questions answered. There is, in fact, an 800 number for reaching the NAEP staff at ETS (800-223-0267~. However, that number is not included in any of the NAEP documentation I have seen, so it is not surprising that many researchers are not aware of it. Another researcher recommended that there be a staff member who understands the NAEP data designated to help secondary analysts. ETS and NCES do have staff members for whom part of their job is to help secondary analysts. However, who these people are is not widely publicized. I was directed to Al Rogers at ETS (800-223-0267) for questions about the data files, accompanying program modules, and special soft- ware and Alex Sedlacek (202-219-1734) at NCES for questions about NAEP grants. The NAEP Web site could help implement this recommendation. The Web site's address could be included with all NAEP reports, including pamphlets like the NAEP facts series. At the Web site, contact information for NAEP staff could be included, such as phone numbers and e-mail addresses. The Web site could also provide information about available training. A second way in which open communication could be facilitated is with a newsletter geared toward secondary analysts that informs researchers about data releases, known problems with the data, available training, NAEP contact staff, the NAEP 800 number, the address

SHEILA BARRON 189 of the Web site, and any changes to NAEP that are being made that might affect secondary analyses. Create Differentiated Data Files The second recommendation is that ETS create data files geared to different types of researchers. For example, for analysts who want to analyze item data, scored data should be available as well as unscored data. There are undoubtedly other special files that would facilitate secondary analysis, and better communi- cation between NAEP staff and researchers would be helpful in identifying them. Provide Guidelines and Examples for Specific Types of Analyses The third recommendation involves changes to the documentation. Re- searchers asked for more examples of analyses using NAEP data. They also asked for more guidelines on how to conduct specific types of analyses. This may be an area that NAEP staff are wary to enter. For many types of analyses (e.g., how best to compare performance on NAEP at the state level), experts do not agree. Thus, for ETS or NCES to advocate a specific method could be controversial. However, it may be possible for NCES to either fund secondary analysts who are interested in documenting the pros and cons of different methods of conducting specific types of analyses or bring together panels of experts to write a report on the issue. Implementing this recommendation would also require that better communi- cation be established. NAEP staff, specifically ETS staff who write the NAEP technical documentation, currently reported knowing little about the specific interests of secondary analysts. They reported trying to make documentation as general as possible to allow for a wide audience of potential users. This is certainly a worthy goal and should be continued. However, researchers would like to see the general guidelines supplemented with examples of how the guide- lines can be put into practice in specific situations. More Ambitious Changes Change the Requirements for Obtaining NAEP Data The fourth recommendation is to make the data easier to obtain. Although this may not be easy to implement because it requires changes in the legislation authorizing NAEP, it seems apparent to many researchers that the security pre- cautions related to NAEP are overly strict and as such are an impediment to broad use of the data by secondary analysts.

190 DIFFICULTIES ASSOCIATED WITH SECONDARY ANALYSIS OF NAEP DATA Simplify NAEP Simplifying NAEP has the potential to ameliorate a number of the concerns voiced by secondary analysts: (1) simplifying NAEP could result in making NAEP results and data available in a much more timely fashion; (2) it could lessen confusion about the data; (3) it could reduce the amount of documentation needed to explain the data and result in documentation that is easier for secondary analysts to use; and (4) it could reduce the amount of help that secondary analysts need as well as make it easier to obtain help because more people would under- stand the data and be able to help others. There are many ways in which NAEP could be simplified. Several are considered below: changes in the information collected, changes in sampling, and changes in scaling. The Information Collected. One way to simplify NAEP would be to reduce the amount of information that is collected. For example, much, if not all, of the background and school-level information that is collected now could be elimi- nated or the constructed-response questions could be eliminated so that hand scoring and polytomous scoring are not necessary. However, researchers who use the NAEP data clearly did not want changes that would reduce the amount of information available on students and schools. In fact, a number of researchers were concerned that a redesign of NAEP would reduce the background and school information collected and they vocally opposed such a change. Sampling. There are other changes that would be of interest to many NAEP researchers but that would add prohibitively to the cost of the program. For example, the cost of obtaining samples of students who are not affected by clustering would facilitate some types of analyses but would dramatically increase the cost of administering the assessment. Scaling. Changes in assessment design that would eliminate the need for condi- tioning (also called "multiple imputations" or "plausible values methodology") have the potential to simplify many aspects of NAEP. Conditioning is the aspect of NAEP that the most secondary analysts voiced concern about. Although conditioning is based on sound theoretical work (Rubin, 1987), it is far from clear that its advantages (i.e., that using conditioning means that the item booklet composition, format, and content balance can be changed over time without adversely impacting the comparability of the results from different years) hold up using real data. Is it feasible to redesign NAEP in such a way that conditioning is not required? Conditioning serves two purposes. The first is the one that is most commonly discussed: conditioning allows fewer items to be administered to each examined. When conditioning was first used in NAEP, the combination of lim- ited testing time (50 minutes) and testing in multiple subjects (as well as asking a

SHEILA BARRON 191 number of background questions) meant that students frequently spend as little as 15 minutes on a particular subject. Clearly, estimates of an individual's profi- ciency in a broad content area cannot be precisely estimated in that short a period of time. However, later administrations of NAEP changed from assessing indi- viduals in multiple subjects to testing individuals in a single subject. Thus, a student now generally spends about 45 minutes being assessed in a single subject area.6 This amount of time is similar to the amount of time students are given on many tests used to generate individual scores. Thus, it is reasonable to at least explore whether conditioning is still needed because of limited numbers of items administered to individuals. One argument for the continued need for conditioning is that NAEP has changed from reporting a proficiency for a single broad content area such as mathematics to reporting proficiencies for more narrowly defined subareas such as geometry. Thus, even though enough items might be administered to an individual to report unconditioned scaled scores in mathematics, not enough items are administered in each content area to accurately estimate unconditioned scaled scores in the subareas. Other ways to handle the additional error in the estimates of the distribution of performance on the subareas need to be explored. The second, and possibly more important reason, for conditioning is its potential to allow changes in the assessment over time that would not be advis- able otherwise. The NAEP technical report states that conditioning allows key population features to be estimated consistently even when item booklet compo- sition, format, and content balance change over time (Allen et al., 1996~.7 How- ever, even though, theoretically, conditioning allows such changes over time, not even the designers of NAEP believe it to the degree that they would rely on it with the long-term NAEP trend assessment.8 The mantra of the long-term trend assessment is "when measuring change, don't change the measure." And even though the content balance of the long-term trend assessment is widely thought to be outdated and the content balance is uneven, that is exactly what has happened for 15 years the exact same set of items has been administered in every assess 6It is important to note that more of this time is used to administer performance items (large items) than was the case in earlier NAEP assessments. 7Questions about the advisability of making such changes to an assessment that is used to measure change are important but beyond the scope of this paper. What does the trend mean if what is assessed over time has changed? Such changes occur because, over time, the content experts have changed what they believe students should know and be able to do. Pressure for the NAEP frame- works to reflect what is currently considered important for students to know and be able to do means that the frameworks change as current thinking changes. Thus, NAEP is subject to all of the shifts that occur in current thinking, such as the relative importance of teaching, and assessing, basic skills versus higher-order thinking skills. 8As noted in the overview of NAEP, the long-term trend assessment is a separate assessment from main NAEP and state NAEP. When results are presented that track achievement in reading, math- ematics, science, and writing since the late 1960s, those results come from long-term trend NAEP.

192 DIFFICULTIES ASSOCIATED WITH SECONDARY ANALYSIS OF NAEP DATA ment cycle in order to accurately maintain long-term trend lines. Although Zieleskiewicz (Chapter 6, this volume) found that the items on long-term trend NAEP are relevant and important in today's classrooms, there are important differences in the content balance on the long-term trend and main NAEP frame- works differences that reflect shifts in what experts believe is important for students to know and be able to do (e.g., more emphasis on algebra and functions in mathematics in main NAEP). There has also been a shift toward greater use of alternative-item formats on the main assessment. Conditioning is used because the long-term trend booklets are from the early administrations when individuals were assessed in multiple subjects. Does conditioning allow accurate measurement of trends when the item booklet composition, format, and content balance change over time? In other words, are the assumptions of the conditioning model adequately met with real data? Is the conditioning model robust when the assumptions are not met? Under what conditions is the model robust? Under what conditions is it not? These questions have not been addressed in the measurement literature. Conditioning is not widely used outside NAEP, and the programs that carry out NAEP condition- ing are not available outside ETS. Thus, it is very difficult for anyone outside ETS to carry out research on conditioning as it is used in NAEP. Without a better understanding of the degree to which the purported benefits of conditioning hold up using real data and simulating these types of changes, it is impossible to judge the necessity of using conditioning in NAEP. Also important would be a com- parison of the results using conditioning and not using conditioning to examine its practical impact. Reported "Scores. " Eliminating the plausible values would also simplify second- ary analysis of NAEP data. Plausible values are used to make apparent to researchers that the estimates of student proficiency are just that estimates. They contain measurement error as a result of asking students only a sample of questions. This is true of all assessments from the SAT to the licensure examines for health care professionals. It is perhaps more important in NAEP because conditioning results in biased estimates of individuals' proficiency, and thus a side benefit of plausible values is that they help analysts remember that individu- als' scores are not calculated to be "best" in the same way they are in other assessments. Plausible values also allow for this measurement error to be esti- mated and used in establishing the significance of comparisons. However, mea- surement error is small in comparison to sampling error, and its omission from estimates of standard error is likely to have little practical impact on the reported results. Consequences of a simpler NAEP design, allowing for the elimination of the conditioning and plausible values, potentially include (1) quicker turnaround for NAEP results and NAEP data for secondary analysis, (2) less confusion about the data, (3) less documentation needed to explain the data and documentation that is

SHEILA BARRON 193 easier for secondary analysts to use, (4) a reduced need for secondary analysts to receive extensive help in order to analyze the data, and (5) a greater confidence in the validity of the results of secondary analysis of the data. Conditioning and plausible values are fundamental aspects of the NAEP scaling procedures and may be considered by some to not be open to discussion, especially discussion focused on secondary analysis of NAEP data. The impor- tance of producing data that are useable to secondary analysts with a wide variety of backgrounds is an issue for policy makers. It was the purpose of this study only to propose ways in which secondary analysis of NAEP data could be facili- tated given that providing data for secondary analysis is currently part of the , · . program s mission. Although eliminating conditioning would increase the amount of upfront work that goes into test development, would require more discipline on the part of the people who decide the content of NAEP, and would mean other changes to the assessment, these changes may be feasible and not prohibitively expensive. In addition, added expense in terms of upfront work may well be made up by savings in the scaling and reporting phases. Clearly, a study of the feasibility of these changes would need to be made by independent researchers in order to empirically address the issue. Summary of Recommendations NAEP is a unique and rich source of information about the student popula- tion in the United States. Currently, much of NAEP's potential is not realized, however, because of the complexity of the data. The changes outlined above- clearer communication between NAEP staff and secondary analysts, documenta- tion and data files geared toward different types of secondary analysts, and a simpler NAEP design have the potential to dramatically increase the amount of research that is conducted using NAEP data, research that could be used to improve education and help students achieve to their potential. ACKNOWLEDGMENTS The author thanks all of the NAEP secondary analysts who shared their NAEP experiences; Larry Ogle from NCES and John Mazzeo and Al Rogers from ETS for providing information about NAEP; and Dean Cotton, Dan Koretz, and Kris Waltman for reviewing drafts of the paper. REFERENCES Allen, N.L., D.L. Kline, and C.A. Zelenak 1996 The NAEP Technical Report: 1994. Washington, D.C.: U.S. Department of Education, National Center for Education Statistics.

194 DIFFICULTIES ASSOCIATED WITH SECONDARY ANALYSIS OF NAEP DATA Beaton, A.E., and R. Zwick 1990 The Effect of Changes in the National Assessment: Disentangling the NAEP 1985-86 Reading Anomaly. Princeton, N.J.: Educational Testing Service. Johnson, E.G., and N.L. Allen 1992 The NAEP Technical Report: 1990. Washington, D.C.: U.S. Department of Education, National Center for Education Statistics. Kenney, P.A., and E.A. Silver 1996 Interpretive Reports for the Fifth, Sixth, and Seventh NAEP Mathematics Assessments: "Lessons learned" from Year One of the Project. Paper presented at the annual confer- ence of the American Educational Research Association, New York, April. Lee, V.E., R.G. Croninger, and J.B. Smith 1997 Course-taking, equity, and mathematics learning: Testing the constrained curriculum hypothesis in U.S. secondary schools. Education Evaluation and Policy Analysis 19(2):99-121. Mislevy, R.J. 1991 Randomization-based inference about latent variables from complex samples. Psychometrika 56:177-196. Mullis, I.V.S., J.A. Dossey, E.H. Owen, and G.W. Phillips 1993 NAEP 1992 Mathematics Report Card for the Nation and the States: Data from the National and Trial State Assessments. Washington, D.C.: National Center for Education Statistics. Rubin, D.B. 1987 Multiple Imputations for Nonresponse in Surveys. New York: John Wiley & Sons.

Next: 10 Putting Surveys, Studies, and Datasets Together: Linking NCES Surveys to One Another and to Datasets from Other Sources »

Grading the Nation's Report Card: Research from the Evaluation of NAEP (2000)

Chapter: 9 Difficulties Associated with Secondary Analysis of NAEP Data

Welcome to OpenBook!

Get Email Updates