Read "Evaluation of the Voluntary National Tests, Year 2: Final Report" at NAP.edu

Page 44 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

4
Technical Issues in Test Development

The year 1 evaluation encompassed an expansive review of pilot test and field test design features, including plans for numbers of items developed, numbers of examinees responding to each item, student sampling procedures, equating the various forms, and analyzing the resulting data. This year 2 evaluation focuses on the following:

the extent to which the design for pilot testing will result in items that represent the content and achievement-level specifications, are free of bias, and support test form assembly;
plans for the implementation of VNT pilot testing;
plans for assembling field test forms likely to yield valid achievement-level results; and
technical adequacy of revised designs for field testing, equating, and linking.

The committee's interim report (National Research Council, 1999c) focused on the first three topics. Since that time, a report was issued by NAGB's Linkage Feasibility Team (LFT) on issues associated with linking VNT scores to the NAEP scale and the NAEP achievement-level cutpoints on that scale. The committee reviewed this report and discussed it with NAGB staff and one of the LFT report authors at its July 1999 meeting. Committee members also observed the discussions of AIR's VNT Technical Advisory Committee at its meeting in June.

This final report includes an expanded discussion of the first three topics as well as the committee's findings and recommendations on issues associated with linking VNT scores to the NAEP scale. The committee reviewed the following documents:

Linking the Voluntary National Tests with NAEP and TIMSS: Design and Analysis Plans (American Institutes for Research, 1998g)
Designs and Item Calibration Plan for the 1999 Pilot Test (American Institutes for Research, 1998f)

Page 45 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

Designs and Item Calibration Plans for Including NAEP Item Blocks in the 1999 Pilot Test of the VNT (American Institutes for Research, 1998e)
Proposed Plan for Calculator Use (American Institutes for Research, 1998j)
Field Test Plans for the VNT: Design and Equating Issues (American Institutes for Research, 1999c)
Score Reporting, Draft (American Institutes for Research, 1998l)
Test Utilization, Draft (American Institutes for Research, 1998n)
Score Reporting, Scoring Examinees, and Technical Specifications: How Should These Be Influenced by the Purposes and Intended Uses of the VNT? (American Institutes for Research, 1999g)
Selected Item Response Theory Scoring Options for Estimating Trait Values (American Institutes for Research, 1999h)
Final Report of the Study Group Investigating the Feasibility of Linking Scores on the Proposed Voluntary National Tests and the National Assessment of Educational Progress (Cizek et al., 1999)
Revised Plans for Linking the Voluntary National Test with NAEP (Johnson, 1999a)
Using Social Moderation to Link the VNT to NAEP (Paulsen, J1999)
VNT: Forms Assembly Procedures and Technical Specifications for the VNTs (AIR, for contract #RJ97153001, 1999l)
An Evaluation of the VNT Pilot Test Design (Reckase, 1999)
VNT Pilot Design Features (Johnson, 1999b)
Evaluation of VNT Pilot Test Design (Hanson, 1999)
Synthesis Paper on VNT Pilot Test Design Features (Ercikan, 1999)
Technical Specifications, Revisions as of June 18, 1999 (American Institute for Research, 1999i)

Many of these documents describe design options on which the Governing Board has not yet taken a definitive position. This report provides comments and recommendations for NAGB's consideration.

PILOT TEST PLANS

Forms Design

Key features of the pilot test forms design are the use of school clusters, the use of hybrid forms, NAEP anchor forms, and item calibration procedures. Each of the first three features affects the item calibration plan.

Use of School Clusters

Current plans for pilot testing call for each participating school to be assigned to one of four school clusters. School clusters are used in the data collection design to minimize item exposure and to improve item security. Schools are assigned to clusters using a combination of random and systematic stratified sampling to maximize the equivalence of examinees across clusters. Forms are assigned to schools within each school cluster so that all schools in a given cluster are administered the same set of pilot forms. The forms are then distributed within each classroom in a given school in a systematic manner so that the examinees completing the different forms within each school cluster will be randomly equivalent.

Page 46 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

The introduction of school clusters increases the complexity of the data collection design-from a random-groups design to a common-item nonequivalent-groups design. The larger the number of school clusters used, the fewer the number of items that will be threatened by a security breach. However, as the number of school clusters increases, creating equivalent school clusters becomes more difficult, and the analyses become more complex.

We conclude that the choice of four school clusters is a good compromise between the need to minimize item exposure and the need to produce accurate item parameters. It is probably the smallest number needed to minimize loss in the event of compromise at a particular school while also minimizing the complexity for administration and analysis. This conclusion is consistent with recommendations in the papers on pilot test design commissioned by AIR (Johnson, 1999b; Hanson 1999; Reckase, 1999).

Use of Hybrid Forms

The pilot test design calls for the creation of a number of "hybrid forms" that are comprised of the first half (45-minute session) of one form (e.g., 1a) paired with the second half of another form (e.g., 2b). Each pilot test form will resemble an operational "final" form insofar as possible with respect to length and administration time, distribution of items by content and format, and distribution of items with respect to other factors (such as calculator use). The use of hybrid or overlapping forms in the data collection design has merit because it permits accurate estimation of item parameters even if the groups within school clusters turn out not to be equivalent. Another advantage of the hybrid design is that it will allow intact NAEP blocks to be combined with VNT half-test blocks, which will provide a basis for comparing VNT and NAEP item difficulties and putting the VNT item parameters on the NAEP scale. To the extent that the NAEP blocks cover the content domain, it also will allow an assessment of the extent to which the VNT and NAEP measure the same areas of knowledge. Thus, we agree with the plan that a hybrid or other overlapping forms design be used in the pilot test.

Generally, data for item calibration must be collected either by administering different collections of items to equivalent samples of students or by administering overlapping collections of items to different samples of students. In the proposed pilot test design, parameters for items appearing in different forms must be placed on the same scale. Without the hybrid forms, the only way to do this is to assume that the random assignment of forms to students within school clusters has worked and has created equivalent groups of students taking each form. This assumption is somewhat risky because logistical problems commonly arise during test administration, leading to unintended deviations from the forms distribution plan. Such deviations affect the equivalence of the groups receiving each form. Thus, a more conservative procedure is to use an overlapping forms design, such as the one proposed by AIR, that provides for different groups of individuals within each school cluster to take overlapping forms.

The rationale for the use of the overlapping forms design is not clearly described in the materials we reviewed. The contractor needs to provide a better explanation for incorporating the hybrid forms design into the pilot study data collection design. The contractor suggests that two advantages of the hybrid forms design are that it provides some protection against violations of the assumption of local independence and that it permits investigation of item context effects. However, violations of local independence are most likely to occur within item sets, and item context effects are most likely to occur because of changes in item order within a test section. Thus, both of these effects are more likely to occur among items within a session than across sessions. It is therefore unclear to us how the proposed

Page 47 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

design will provide protection against these particular effects, although we endorse the use of hybrid forms, as noted above, for other reasons.

NAEP Anchor Forms

A major feature of the proposed VNT is that student performance will be reported in terms of NAEP achievement levels. To facilitate linking the two assessments, the most recent version of the pilot test design calls for the inclusion of NAEP item blocks in two of the four school clusters. The proposed item calibration plan calls for the estimation of NAEP item parameters along with VNT item parameters and, thus it implicitly assumes that NAEP and VNT measure the same content constructs. This assumption can be questioned since the distribution of item formats differs for the two assessments (e.g., differing numbers of constructed-response and multiple-choice items). Data for the groups of examinees in the proposed design who take VNT items in one session and NAEP items in the other session (e.g., 1a,Nb) can be used to assess the extent to which VNT and NAEP measure the same skills. For example, correlations between scores for two VNT sessions, between two NAEP sessions, and between a VNT session and a NAEP session can be computed and compared. We strongly support the inclusion of NAEP blocks in the pilot test design to provide data on the feasibility of a common calibration of VNT and NAEP items as a means of linking the two scales (see discussion of linkage issues below).

Item Calibration

Item calibration refers to the procedures used for estimating item parameters or characteristics of items, such as difficulty level. For NAEP (and proposed for VNT), item calibration is accomplished by using procedures based on item response theory (IRT), a statistical model that expresses the probability of getting an item correct as a function of the underlying ability being measured. Item characteristics are group dependent, that is, an item may appear easy or hard depending on the ability level of the group taking the item. Thus, to compare the difficulty parameter estimates of items that were administered to different sets of examinees, it is necessary to place (or link) the different sets of item parameter estimates on a common scale. For VNT items, the desire is to link the item parameters to the NAEP scale. The item calibration and linking process is technically complex, and the committee's findings and suggestions are described below in a technical manner in order to be of the most use to the test developers.

A variety of procedures can be used for obtaining item parameter estimates that are on the same scale using the common-item nonequivalent-groups design. The contractor presents three options: (1) simultaneous (one-stage) calibration, (2) two-stage calibration and linking, and (3) the Stocking-Lord test characteristic curve (TCC) transformations (American Institutes for Research, 1998e). The contractor states that a disadvantage of the Stocking-Lord procedure is that "the procedure may result in the accumulation of relatively large amounts of equating error, given the large number of 'links' in the chain of equatings required to adjust the test characteristic curves of some of the non-anchor forms. Also, it may be prohibitively time-consuming given the large number of computations required" (American Institutes for Research, 1998e:11). This rationale is also presented as the primary reason for making the Stocking-Lord procedure the least preferred method for putting item parameters on a common scale.

There are several alternative ways in which the Stocking-Lord TCC procedure can be implemented for the VNT pilot test design. Two options that deserve more consideration are presented

Page 48 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

below. Both use a computer program developed by the Educational Testing Service (ETS) called PARSCALE, the program that is used to produce item parameter estimates for NAEP.

In option 1, for each school cluster, perform one PARSCALE run that simultaneously calibrates all items within that cluster. Select a school cluster to use as the base scale (say, cluster 1). Use the item parameters for the common items (i.e., 1a, 1b, 2a, 2b) to compute a scale transformation from cluster 2 to cluster 1 and apply that transformation to the item parameter estimates for all items within cluster 2. Repeat this process for clusters 3 and 4. This option produces one set of item parameters for all items in the non-anchor forms, but it results in four sets of item parameters for the items in the anchor forms.

In option 2, perform one PARSCALE run for the anchor forms, combining data across school clusters, and then perform the four within-cluster PARSCALE runs described in option 1. The base scale is defined using the item parameter estimates for the anchor items from the across-cluster run. In each cluster, a scale transformation is computed using the item parameter estimates from the within-cluster run and the item parameter estimates from the across-cluster run.

Option 1 for implementing the Stocking-Lord TCC procedure requires three scale transformations, and option 2 requires four scale transformations. Neither option requires a "large" number of transformations, and both are as easy to implement as the two-stage calibration and linking procedure.

The across-cluster PARSCALE run in option 2 is the same as the first stage of the two-stage calibration and linking procedure proposed by the contractor. The four within-cluster PARSCALE runs in options 1 and 2 are similar to stage two of the two-stage calibration and linking procedure, with the exception that the item parameters for the anchor items are estimated rather than fixed. An advantage of the Stocking-Lord TCC procedure over the two-stage calibration and linking procedure is that the multiple sets of parameter estimates for the anchor forms can be used to provide a check on model fit. Consequently, we suggest that the contractor select the calibration procedure that is best suited to the final data collection design, is compatible with software limitations, and permits item-fit analyses. To further assess the degree to which VNT and NAEP measure similar areas of knowledge, calibrations for VNT items could be performed and compared both with and without the NAEP items.

RECOMMENDATION 4.1 Pilot test plans should include school clusters, overlapping (hybrid) forms design, and NAEP anchor forms, as currently planned. In addition, the contractor should select the calibration procedure that is best suited to the final data collection design and in accord with software limitations and should plan to conduct item-fit analyses.

Forms Assembly and Item Survival Rates

A key question addressed in the Phase I report was whether the number of items to be pilot tested is large enough to enable the assembly of six high-quality forms. The primary purpose of the pilot test is to use statistical information to identify "flawed" items. Flawed items are set aside for further revision or dropped altogether from further consideration. Items that are not flagged as flawed are said to "survive" the pilot test phase.

In mathematics, 18 forms of VNT items will be pilot tested in order to support assembly of 6 operational forms. The minimum survival rate for items in each mathematics content and format category is one-third. The average survival rate, however, must be quite a bit higher. The developers need a pool that is larger than six forms of items to draw from in order to select items for each form that meet difficulty distribution and information targets, as well as content and format requirements. On the basis of experience with other programs and because of the extensive editorial review of the VNT

Page 49 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

items prior to pilot testing, we believe it is reasonable to expect at least a two-thirds survival rate for the mathematics items, yielding two items for every one needed in the final forms.

For reading, the situation is a bit more complex. A total of 24 forms of reading items will be included in the pilot test, and each reading passage will be included twice with distinct sets of items. Overall, there will be 72 passages in the pilot test and 36 passages in the six operational forms constructed for the field test. Thus, the minimum survival rate for each purpose and length category of reading passage is 50 percent. For each type of passage, there must be a specified number of questions: five multiple-choice questions for short passages (literary or informational), and seven multiple-choice, two short constructed-response, and one extended constructed-response question for each long literary passage. Medium information passages have been developed in pairs: for each pair, there must be a set of three multiple-choice intertextual items asking questions that draw on material from both passages in the pair. Each passage or passage pair is being tried out with two independent sets of items. For the passage to be included in an operational form, at least 50 percent of the items, for each format type, must survive pilot test screening.

The contractor has indicated that pilot test forms will be assembled to match the target form specifications to the extent possible. Although the idea of creating pilot test forms that resemble operational forms is reasonable, it implies an equal survival rate for various types of items. This may not be the case, since constructed-response items have had lower survival rates than multiple-choice items in other programs and in NAEP. The materials we reviewed did not specify the expected survival rates for the various item types, nor did they discuss the rationale for determining item production or item survival rates. These issues were discussed at our July meeting with AIR staff, who responded that because the constructed-response items had gone through cognitive laboratory analysis, the survival rate for these items is expected to be as high or higher than that of the multiple-choice items. Because VNT item development is necessarily unique in many ways, neither the developers nor the committee have an empirical basis for estimating survival rates. We concur with the developer's judgment that the overall number of items to be pilot tested appears reasonable and repeat our hope that further edits of distractor quality will reduce the number of items dropped after pilot testing.

RECOMMENDATION 4.2 Information regarding expected item survival rates from pilot to field test should be stated explicitly, and NAGB should consider pilot testing additional constructed-response items, given the likelihood of greater rates of problems with these types of items than with multiple-choice items.

Pilot Test Analyses

The materials we reviewed at our July meeting included specifications for screening items according to item difficulty levels and item-total correlations based on the pilot test data (American Institutes for Research, 1999i). More recent documents have added plans for screening multiple-choice items for appropriate statistics on each incorrect (distractor) option, as well as for the correct option (American Institutes for Research, 19991). Plans should be extended to include screening items on the basis of model fit for item response theory.

Differential Item Functioning

Additional specifications are needed for the ways in which items will be screened for differential item functioning (DIF) and the ways DIF results will be used in making decisions about the items.

Page 50 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

Differential item functioning refers to the situation that occurs when examinees from different groups have differing probabilities of getting an item correct, after being matched on ability level. For DIF analyses, examinees may be placed into groups according to a variety of characteristics, but often gender and ethnicity are the groupings of interest. Groupings are created so that one is the focal group (e.g., Hispanics) and the other is the referent group (e.g., whites). The total test score is generally used as the ability measure on which to match individuals across groups.

DIF analyses compare the probabilities of getting an item correct for individuals in the focal group with individuals in the referent group who have the same test score. Items for which the probabilities differ significantly (or for which the probabilities differ by a specified amount) are generally "flagged" and examined by judges to evaluate the nature of the differences. Good test development practice calls for symmetric treatment of DIF-flagged items. That is, items are flagged whenever the probability of correct response is significantly lower for one group than for the other after controlling for total test score, whether the lower probability is for the focal group or for the referent group.

A number of statistical procedures are available for use in screening items for DIF. The contractor has proposed to use the Mantel-Hanszel method for the pilot test data and methods based on item response theory for the field test data. The sampling plan will allow for comparisons based on race/ethnicity (African Americans and whites, Hispanics and whites) and gender (girls and boys). The sampling plan calls for oversampling during the pilot test in order to collect data on each item for 200 African Americans, 168 Hispanics, and 400 girls.

The proposed sample sizes and methods for DIF analyses are acceptable. However, the contractor needs to provide more information about the ways in which DIF data will be analyzed and used. It is important to know what role DIF statistics will play in making judgments about the acceptability of items. Will strict cutoffs based on statistical indices be used to eliminate items? How will human judgment be incorporated into the decision-making process? It is also important to know whether expert reviews for sensitivity are being conducted prior to the pilot testing and whether the experts for such reviews will be consulted in evaluating items flagged for DIF. In addition, what will happen when DIF favors one focal group but disadvantages others (e.g., favors Hispanics but disadvantages African Americans)?

Computation of DIF statistics requires the formulation of a criterion score on which to match individuals. The pilot test form administration plan (which includes the hybrid forms) creates a series of half tests. Half tests are paired to create a full test form, as described above, and items appear in only a single half test. Will the criterion score be formed using the half test in which the item appeared? In this case, the criterion score will be based on smaller numbers of items, which will affect the reliability of the ability estimate. Or will the criterion score be formed using the total score for the two paired half tests so that multiple estimates will exist for each item (e.g., 1a combined with 1b compared with 1a combined with 2b)? Will a two-step process he used that includes refinement of the matching criterion through the elimination of extreme DIF items from the criterion score? What are the plans for dealing with these issues?

RECOMMENDATION 4.3 NAGB and its contractor should continue to detail plans for analyzing the pilot test data. Additional specifications should be provided for assessing the extent to which each item fits the model being used for calibration and the ways in which differential item functioning analyses results will be used in making decisions about the items.

Page 51 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

ASSEMBLING FIELD TEST FORMS

The primary purpose of the field test is to try out six operational forms for each of the two subject areas. Field test data will yield information on the psychometric properties of each form, provide the basis for equating the different forms to a common scale, provide normative information on student performance on each test, and also provide data for linking VNT scores to the NAEP achievement levels. The committee considered a number of topics related to plans for the field test, including issues in constructing the operational forms, sampling issues, and plans for analyses.

Targets for item difficulty or, more importantly, test information at different score levels need to be set before operational forms can be assembled. Item difficulty targets speak to the expected difficulty of the test forms. The test information function provides estimates of the expected accuracy of a test form at different score levels. Test difficulty and information functions are developed to be in accordance with the intended purposes and uses of a test. Preliminary information, which lays out the main issues, is available from the contractor (American Institutes for Research, 1999g). Figure 4-1, taken from the document, shows four potential test information functions for the VNT.

As stated above, test information functions indicate the level of precision with which each ability level (or score) is estimated. Tests can be designed to maximize the amount of information provided at specific test scores, usually the test scores of most interest (e.g., where a cutpoint occurs or where most students perform). One of the test information functions (line 3) maximizes test information between the proficient and the advanced score levels. This function, in particular, will not be useful, since the majority of the students are likely to be at the basic score level. Another function (line 4) maximizes information across all score levels. The contractor (American Institutes for Research, 19991) recently recommended use of a function similar to line 4. However, if the VNT is being constructed from NAEP-like items, there may not be sufficient numbers of easy and hard items in the item pools to provide this level of measurement at the extremes. Use of target test information functions in test assembly will facilitate score comparability across forms and ensure that the measurement properties of the test support the intended score uses.

RECOMMENDATION 4.4 A target test information function should be decided on and set. Although accuracy at all levels is important, accuracy at the lower boundaries of the basic and proficient levels appears most critical. Equivalent accuracy at the lower boundary of the advanced level may not be feasible with the current mix of items, and it may not be desirable because the resulting test would be too difficult for most students.

SPECIAL FORMS FOR BELOW-BASIC AND ADVANCE STUDENTS

The committee is concerned with the difficulty of creating a single test form to provide accurate information about students at both the below-basic and advanced levels. Forms that contain a sufficient number of items to provide accurate information at the advanced level create special problems for students struggling to reach the basic level. The difficult items will be frustrating to these students and may have negative effects on their self-confidence. Furthermore, the difficult items will provide little information on the skill levels for students who score below the basic level. The reverse problem is likely to be true for students at the advanced levels, for whom the easier questions will provide little information.

With NAEP, the focus is on estimating score distributions. Students are not given any feedback on their performance. Questions about which children should be included in the NAEP assessment relate

Page 52 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

FIGURE 4-1

Hypothetical test information functions. SOURCE: American Institutes for Research (1999:Fig. 1). NOTES: Line 1 represents a NAEP-like TIF, which has the maximum value around cut two (between basic and proficient). Line 2 represents a TIF with maximum value at score level lower than NAEP TIF (around cut one, between below basic and basic). Line 3 represents a TIF with maximum value at score level higher than NAEP TIF (around cut three, between proficient and advanced). Line 4 represents a TIF with approximately equal precision at all score levels.

Page 53 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

to the effects of inclusion on the validity of the score distribution estimates and not to the effects on individual students. The VNT will be quite different in that individual student scores are the primary focus. For this reason, the committee believes it is important to consider the effects of the test scores on the students and parents who receive them. For students classified at the advanced level, there seems little reason for concern. Their interaction with the items should be itself educational, and the feedback they receive will reinforce the value of the skills they have acquired. As indicated above, however, a large proportion of students will not even achieve the basic level. For them, the experience of confronting a large number of questions that they cannot begin to answer will be frustrating at best.

In reviewing reading passages selected for the VNT, the committee noted that all of the passages were written at the 4th-grade level or higher. These items will provide little information on the reading skills of students below the basic achievement level, students who are most in need of feedback about their reading skills. Similarly, students struggling to learn English as a second language may not be able to demonstrate the reading skills they do have unless some easier texts are included. Neither the committee nor NAGB has sufficient information at this time to evaluate how accurately students at different levels will be classified on the basis of their responses to different samples of VNT items. However, it is important that NAGB now consider issues associated with special forms for students judged likely to be below basic, and possibly also for advanced students, so that an informed decision can be made when pilot test data become available. For these reasons, the committee believes it would be desirable to explore the creation of special forms of the VNT for use with students who are likely to find the regular VNT form too difficult. An easier form might also serve as an important accommodation for students with some types of learning disabilities and, especially in the case of the 4th-grade reading test, for students with limited English proficiency. It might also be desirable to consider a special ''advanced" form to provide more accurate distinctions at the advanced level.

The need for better information on students below the basic level is a very significant issue for the VNT. As measured by NAEP, in 1998, 38 percent of children in public schools were below the basic level in reading at grade 4; similarly, in 1996, 39 percent were below the basic level in mathematics at grade 8. Among specific populations within the United States, the numbers are much larger: 64 percent of African American students, 60 percent of Hispanic students, and 53 percent of Native American students were below the basic achievement level at grade 4 in reading; also, 45 percent of students in central-city schools and 58 percent of students eligible for free or reduced-price lunches were below the basic level in reading at grade 4 (Donahue et al., 1999). In mathematics, the problem for specific groups was even more dramatic, with 73 percent of African American students, 63 percent of Hispanic students, and 50 percent of Native American students scoring below the basic level in the 1996 grade 8 mathematics assessment; in addition, 53 percent of the students in central, city schools and 72 percent of the Title I participants scored below the basic level (Shaughnessy et al., 1997:93).

In order to promote better performance among the nation's less skilled readers, students, parents, and teachers need high-quality information about their performance. At the moment, many children at grade 4 may not be able to productively participate in the VNT because they will not be able to read the relatively challenging passages on the test. This is especially pertinent for limited-English-proficient students or others with language-based difficulties and can lead to the erroneous conclusion that these children are unable to think about texts or are unable to read anything. In short, there could be no useful information about what they know and can do, a terrible disservice to students who need special attention. Similarly, teachers will not receive information about what these students know and can do, limiting their ability to use information from the VNT to help students and their families.

Page 54 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

Unless students and their parents can expect high-quality information from the test, they may question why instructional time should be used for it.

The committee recognizes that there would be added cost and complexity to develop and scale forms of alternate difficulty and to administer such forms appropriately. In addition, explaining the differences between the test forms to test users will be challenging. It may be difficult for users to understand that students taking an easier form are not being held to different standards-a lower standard since they are allowed to take an easier form or a higher standard since they would need a higher percentage of items correct to be classified basic or proficient. Interpretive material would be needed to help students, teachers, and parents understand the information reported and released from the tests (test questions, answers to test questions, percent correct scores, etc.), given that the information would be based on tests of different difficulty levels for different students. These issues commonly arise with all forms of adaptive testing, however, and while complex, have been explained to the public in connection with other testing programs. We recommend, therefore, that exploration of this concept focus initially on the feasibility of creating an easier form to provide more information on students who score below the basic category. Further consideration of the potential benefits and problems associated with multiple-level forms could also be added to ongoing efforts to evaluate alternative reporting options.

Creation of an easy form (in place of one of the current six forms) would require review of item development plans to ensure an adequate supply of easier items in each content and format area. For reading, the issue is not just creating easier items, but also including passages with significantly lower reading levels (i.e., well below the average 4th-grade level). Given current time constraints, we realize that the creation of special easy (or difficult) forms would most likely have to be planned as an addition to the VNT in subsequent years. Yet the potential value of such forms warrants an early consideration of feasibility and potential advantages and disadvantages.

RECOMMENDATION 4.5 NAGB should consider plans for development of an alternate form of the VNT targeted to students at the low end of the achievement scale.

LINKING VNT SCORES TO NAEP ACHIEVEMENT LEVELS

The primary purpose of a link between VNT and NAEP is to enable students to compare their performance with national standards. The content specifications for the VNT are based on the NAEP frameworks, leading to a not unreasonable expectation that scores on the VNT should be able to be compared with the NAEP standards for student achievement on the content covered by these frameworks.

Linking scores from various tests was the subject of an NRC study conducted in 1998. The Committee on Equivalency and Linkage considered the feasibility of developing a scale to compare, or link, scores from existing commercial and state tests to each other and to NAEP. The committee's conclusions were generally negative (National Research Council, 1999e:4-5):

Comparing the full array of currently administered commercial and state achievement tests to one another, through the development of a single equivalency or linking scale, is not feasible.
Reporting individual student scores from the full array of state and commercial achievement tests on the NAEP scale and transforming individual scores on these various tests and assessments into the NAEP achievement levels are not feasible.
Under limited conditions it may be possible to calculate a linkage between two tests, but

Page 55 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

multiple factors affect the validity of inferences drawn from the linked scores. These factors include the content, format, and margins of error of the tests; the intended and actual uses of the tests; and the consequences attached to the results of the tests. When tests differ on any of these factors, some limited interpretations of the linked results may be defensible while others would not.

Links between most existing tests and NAEP, for the purposes of reporting individual students' scores on the NAEP scale and in terms of the NAEP achievement levels will be problematic. Unless the test to be linked to NAEP is very similar to NAEP in content, format, and uses, the resulting linkage is likely to he unstable and potentially misleading.

The VNT is designed to be as similar to NAEP as possible and so some of the problems described in the NRC linkage report may be ameliorated. A subsequent study was done by the Linking Feasibility Team (LFT), commissioned by NAGB to investigate the feasibility of linking scores on the proposed VNT and NAEP. The LFT recommended linking through the method of calibration, if the following assumptions are met (Cizek et al., 1999:iii): "1) the VNT and NAEP measure the same constructs as established by content review; 2) the VNT and NAEP measure the NAEP constructs as established empirically; and 3) the content of the VNT can support NAEP achievement levels descriptions." The report states that if these requirements are met, VNT scores could be interpreted directly as estimates of NAEP scores, and NAEP achievement-level descriptions could he used to help interpret VNT achievement-level estimates. The authors conclude that calibration is the only methodology that would lead to such direct interpretation. If these requirements are not met, they recommend the use of social moderation, a judgmental procedure.¹ New labels for achievement levels would be needed and new achievement-level descriptions would be created.

As detailed in the reports from the preceding efforts (National Research Council, 1999e: Cizek et al., 1999), the relationship between VNT and NAEP will be less than perfect for a number of diverse reasons:

Even though the VNT is based on the NAEP frameworks, the VNT is not being built to exactly mirror the content and statistical specifications of NAEP. For example, the VNT contains a smaller proportion of constructed-response questions in order to make it more efficient to administer and score.
The VNT is being designed to provide more information than NAEP at the basic and "below basic" achievement levels because VNT scores will be reported for individuals, and a large number of students perform below the basic level.
VNT scores will be observed scores for individual test takers and, as such, will contain measurement error.
Different students take different subsets of items during a NAEP assessment so NAEP scores do not exist, and are not reported, for individual test takers. instead, student data are used to estimate true score distributions for various population subgroups.
The administration conditions for the two tests will differ. For example, VNT administration may not be centrally monitored in the same way as NAEP. Also, students may be allowed to use

¹

Linkages between the two measures can be obtained by matching distributions of scores and deriving a score to score correspondence, using procedures like those used for equating except that no presumption is made that the two tests measure the same variable. Such procedures are called statistical moderation. linkages can also be established when score distribution matching is judgmentally derived; such procedures we referred to as social moderation (Cizek et al., 1999).

Page 56 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

their own calculators on the mathematics test instead of using standard calculators, as is done with NAEP.

Because the VNT will report scores to students, parents, and teachers, there is the potential for much higher motivation for students taking the VNT.

The committee agrees with the findings of the LFT (Cizek et al., 1999) and others that, because of the differences in test specifications, scoring, and administration conditions, it will not be possible to create a strong link between NAEP and VNT scores, regardless of the linking procedure used. As a result, it will be inappropriate for educators and policy makers to make inferences about a group's performance on NAEP based on data from the VNT.

The NAEP-VNT linkage is different from many of the examples of failed linkages (National Research Council, 1999e) in the high degree of content similarity between the two exams. There is no precise measure of content similarity or its effect on linkages, and the LFT identified a number of ways in which the format and content of the VNT will differ from NAEP. So far, we have only expert judgment on the extent to which prior research will generalize to VNT-NAEP linkage efforts and on the amount of error, initially and over time, that will be introduced by the differences in administration and use between the tests.

Although it may not be possible to create a linkage between NAEP and VNT that permits direct inferences from one to the other, it may be possible to establish a link that supports other types of inferences. For example, suppose a linkage could be created that permits inferences from student performance on the VNT to student performance on NAEP when NAEP is given under VNT conditions.

To create this type of linkage, a short form of NAEP (representative of the content and statistical specifications for NAEP) would be constructed and spiraled along with the VNT under VNT conditions (e.g., two 45-minute sections, students provide their own calculators). The linkage between the VNT and the NAEP-like VNT (or short-form of NAEP) could then be accomplished through a simultaneous IRT scaling of the VNT and short-form NAEP items, through separate IRT scalings followed by a Stocking-Lord test characteristic function transformation, or through an equipercentile matching of the raw score distributions.

Achievement-level cutpoints on the short-form NAEP scale could be obtained by using the judgmental proportions correct from the achievement-level setting used for the main NAEP. These Angoff proportions would be projected onto the short-form NAEP scale using the same procedure as used by ACT to project the Angoff proportions on the main NAEP scale. These cutpoints would then be projected to the VNT scale by using the linkage established between VNT and NAEP given under VNT administration conditions.

Item parameters computed for the short-form NAEP items administered under main NAEP conditions could be compared with those computed from data collected under VNT conditions in order to check the comparability of the achievement-level cutpoints between the main NAEP and the short, form NAEP. Other checks on the quality of the linkage between VNT and short-form NAEP would include the stability of the linkage function for a variety of subgroups (e.g., males and females) and similarity in the shapes of the proficiency distributions for the two types of tests.

The above plan is very similar to one proposed by Johnson (Johnson, 1999a). It differs only in that the test to which the VNT is being linked is built from NAEP items to be as representative of NAEP as possible, rather than to be as representative of the VNT as possible. This approach may make the observed linkage slightly less stable because of potentially greater content differences between the tests being linked. But it does strengthen the extent to which the resulting linkage can be interpreted in

Page 57 Cite

Suggested Citation:"Technical Issues in Test Development." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

terms of student performance on NAEP when NAEP is given under VNT conditions. Furthermore, it increases the perceived face validity of the linkage.

We believe that the above linkage plan should be tried out during the pilot test. At a minimum, it would require the development of a two-section short-form NAEP. The short-form NAEP would be spiraled with the VNT forms in the pilot test in at least one school cluster. In addition, hybrid forms composed of one VNT section and one NAEP section could be included to reduce the need to assume equivalence of samples within a school cluster. Ideally, more than one short-form NAEP would be developed and included in the pilot test, which would enable the resulting linkage to take account of form-to-form differences.

Because the main NAEP, the short-form NAEP, and the VNT would each be based on the same content frameworks, use similar item types, and be administered to the same population, we recommend that the linkage be based on an empirical procedure rather than on social moderation. We would only recommend the use of social moderation if sample sizes were too small to support an empirical linkage.

Investigation of the linkage during the pilot test would provide valuable information that could be used to refine the actual linkage plan to be carried out during the field test. It would also provide useful information about the anticipated quality of the linkage. Such information would be helpful in preparing guidelines for score use, score reports, and score interpretative materials, including those related to score aggregation. It would also expedite the actual score reporting process following the field test.

RECOMMENDATION 4.6 Plans for the VNT pilot test should include efforts to gather empirical data on the effects of content, administration, and use differences between the VNT and NAEP on the feasibility of linking VNT scores to the NAEP score scale. Specifically, a NAEP-like form (e.g., two non-overlapping booklets from recent 4th-grade reading and 8th-grade mathematics assessments) should be included to allow for an assessment of the effect of content differences and administration differences on the linkage of VNT scores to the NAEP scale.