VNT Pilot and Field Test Plans

After item development is completed, items will be assembled into tryout forms for pilot testing. The pilot test will provide empirical evidence on item quality that will be used in screening the Voluntary National Tests (VNT) items. Items that survive this screening will be assembled into six operational forms for the field test. Data on the statistical characteristics of these forms will be collected in the field test, providing the basis for equating the forms to each other (placing their scores on a common scale) and linking the scores from the forms to the National Assessment of Educational Progress (NAEP) scale and NAEP achievement levels. In addition, the field test will provide an important test of operational test administration procedures and provide a basis for linking VNT scores to the scale used to report 8th-grade mathematics results from the Third International Mathematics and Science Study (TIMSS).

Pilot Test Plans.

Our second workshop, in April 1998, reviewed plans for conducting a pilot test of VNT items and plans for subsequent field test of VNT test forms (see Appendix B for the list of participants). Four documents from AIR were among the materials reviewed:

  • (1)  

    Linking the Voluntary National Tests with NAEP and TIMSS: Design and Analysis Plans (February 20, 1998)

  • (2)  

    Designs and Equating Plan for the 2000 Field Test (April 9, 1998)

  • (3)  

    Designs and Item Calibration Plan for the 1999 Pilot Test (April 24, 1998)

  • (4)  

    Sample Design Plan for the 1999 Pilot Test (April 28, 1998)

The quality of the VNT items selected for inclusion in operational forms and the accuracy with which those forms are pre-equated depend very heavily on the effectiveness of the pilot test plans. This

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 35
--> 4 VNT Pilot and Field Test Plans After item development is completed, items will be assembled into tryout forms for pilot testing. The pilot test will provide empirical evidence on item quality that will be used in screening the Voluntary National Tests (VNT) items. Items that survive this screening will be assembled into six operational forms for the field test. Data on the statistical characteristics of these forms will be collected in the field test, providing the basis for equating the forms to each other (placing their scores on a common scale) and linking the scores from the forms to the National Assessment of Educational Progress (NAEP) scale and NAEP achievement levels. In addition, the field test will provide an important test of operational test administration procedures and provide a basis for linking VNT scores to the scale used to report 8th-grade mathematics results from the Third International Mathematics and Science Study (TIMSS). Pilot Test Plans. Our second workshop, in April 1998, reviewed plans for conducting a pilot test of VNT items and plans for subsequent field test of VNT test forms (see Appendix B for the list of participants). Four documents from AIR were among the materials reviewed: (1)   Linking the Voluntary National Tests with NAEP and TIMSS: Design and Analysis Plans (February 20, 1998) (2)   Designs and Equating Plan for the 2000 Field Test (April 9, 1998) (3)   Designs and Item Calibration Plan for the 1999 Pilot Test (April 24, 1998) (4)   Sample Design Plan for the 1999 Pilot Test (April 28, 1998) The quality of the VNT items selected for inclusion in operational forms and the accuracy with which those forms are pre-equated depend very heavily on the effectiveness of the pilot test plans. This

OCR for page 35
--> chapter presents our findings and conclusions, based on the workshop review and our own review of subsequent documents. Number of Items Plans call for trying out 2,160 items, split evenly between 4th-grade reading and 8th-grade mathematics in the pilot test. For mathematics, a total of 360 items will be included in the six operational forms built for the field test. The number of mathematics items to be piloted, 1,080, is three times this number. Experience with similar programs (e.g., the Armed Services Vocational Aptitude Battery) suggests that, even with careful editing prior to the pilot test, as many as one-third of the items piloted may be flagged for revision or dropping based on item statistics from the pilot test (low item-total correlations, differential item functioning, out-of-range difficulties, positive item-total correlations for an incorrect option, high omit rates, etc.). The screening criteria for these decisions have not yet been specified. Many flagged items can be revised, but a subsequent pilot would be required to calibrate the revised items prior to selection for an operational form. However, even if the screening rate is twice as much (two-thirds) for some categories of items, the surviving one-third would be sufficient to construct six forms meeting the test specifications. While the number of items piloted appears to be comfortably large, it does not appear to be inappropriately large. If the item survival rate is, indeed, two-thirds, there will be twice as many acceptable items as needed for the first set of operational forms. This number will allow items to be selected from each content and format category so as to create forms of similar difficulty, both overall and with equal difficulty in each content and format category. Remaining items can be held for use in future forms, so an overage would not waste effort. (Note that item development for subsequent forms should be targeted to fill in specific holes in the content, format, and difficulty distributions of the acceptable items not used in the first set of forms.) For reading, the situation is a bit more complex because items are grouped in sets associated with particular passages. The survival rate of whole passages must be considered, along with the survival rate of individual items. Overall, 72 passages will be piloted, and one-half of this number will be needed to assemble six operational forms. Each passage will be piloted twice with separate item sets: for a passage to have enough items to be used operationally, an appropriately distributed 50 percent item survival rate will be required. The survival rate of multiple-choice items should not present a problem. Suppose a passage requires about 6 acceptable multiple-choice items of the 12 being tried out. If the survival rate is two-thirds (and failure probabilities are independent across items within a set), the probability of at least six items surviving is above 93 percent. For constructed-response items, the survival demands are a little higher. Many passages will require a single constructed-response item from two being developed for the pilot test. If the survival rate is two-thirds for each item, the chances of both items failing would be one-ninth, which is a passage survival rate of 89 percent. Under these assumptions, the chance of survival of both a sufficient number of multiple-choice and constructed-response items 0.93 × 0.89 or about 83 percent, well above the 50 percent survival rate required for a sufficient number of passages for the operational form. We note, however, that there is no very good basis for estimating the survival rate for constructed-response items. Furthermore, the passage survival rate must be at least 50 percent for each of the five passage types (short, medium, and long literary passages and short and medium information passages). Even with the planned overage of passages, there is a significant possibility that an insufficient number of passages will survive screening rules in one or more of the passage type categories. Thus, it would be prudent to consider fallback options if this should occur. Such options might include relaxing the

OCR for page 35
--> screening criteria (which have not yet been specifically stated) or allowing some modification of items or scoring rubrics between pilot and field tests. The latter option would reduce the precision with which forms are pre-equated with respect to difficulty and test information. This option could still be chosen if it is judged that the differences that do occur could be offset through equating adjustments. Sample Size and Sampling Plan The proposed plan calls for samples of 24,000 4th-grade students in 558 schools to participate in the pilot test of VNT reading items and 19,200 8th-grade students in 344 schools to participate in the pilot test of VNT mathematics items. The sample sizes were set to provide for response data from at least 800 students for each pilot test form. Two reading and two mathematics forms will each be administered to three such samples to permit linking across clusters of students, as described below. There will be a modest oversampling of schools with higher proportions of minority students so that approximately 150 Hispanic and 200 African American students will complete each test booklet. While the current pilot test plan does not explicitly list the item statistics to be estimated for each item, common practice is to focus on the following: proportion passing (or proportion at each score level for items scored on a three- or five-point scale); item-total correlations; the frequency with which each distractor option is selected (for multiple-choice items); the proportion of examinees who do not respond to or do not reach an item; and differences in passing rates or mean scores for students from different demographic groups who are at the same level of overall ability (as estimated by the other items in the test). The demographic groups that are usually specified include females, African Americans, and Hispanics. In addition to these “conventional” statistics, item response theory (IRT) parameters will be estimated for each item for use in pre-equating alternative forms and for estimating test information functions that give the expected accuracy of a form at different score levels.1 A simple random sample of 800 students would lead to 95 percent confidence bounds for proportions of less than .035. Even allowing for a modest design effect due to the use of two-stage (geographic areas and then schools within sampled areas) rather than simple random sampling, the confidence bounds will still be less than .05. At this level of accuracy, there should be no problem in distinguishing relatively difficult items (passing proportions in the .3 to .4 range) from relatively easy items (passing proportions above .8). Similarly, with a sample size of 800, the standard error of a correlation would be about .035. This should be perfectly adequate for distinguishing items with acceptable item-total correlations (generally above .2 or .3) from items that do not correlate with what the rest of the test is measuring (generally, correlations that are zero or negative). There are, of course, a very large number of items to be screened, requiring a large number of different statistical tests. Some items, near the cutoff for a screening decision, may be misclassified even with very large samples. Once a plan for specific screening decisions has been enunciated (e.g., eliminating items with unacceptable validity on differential item functioning [DIF] values), a more complete power analysis should be performed. The key point is that relatively small classification errors (e.g., accepting an item that is actually slightly below a cutoff) are not a major problem: normal item selection procedures avoid items that are near the boundary of a screening decision. Furthermore, since overall test difficulty and accuracy is related to averages of item statistics across all items in a test, small errors in statistics for individual items will tend to average out. 1    Item response theory is a statistical model that calculates the probability each student will get a particular item correct as a function of underlying ability; for further discussion of IRT modeling, see Lord and Novick (1968).

OCR for page 35
--> The adequacy of the sample size for proposed analyses of DIF by demographic group is supported by prior research. For future revisions of this plan, however, we would welcome a more specific description of the size of differences that should be detected and the power of the proposed samples to detect those differences. Nonetheless, the proposed sample size for the targeted demographic groups are quite consistent with common practice, and we do not question them. The chief focus of the analyses as described in the pilot test plan is “calibration” rather than screening. At the item level, calibration can mean simply estimating an item's difficulty. The more common meaning used here involves estimating parameters of item characteristic curves that predict the percent passing as a function of underlying ability level. The plan proposes using a computer program developed by the Educational Testing Service (ETS) called PARSCALE. This is the program that is used to produce item parameter estimates (and also student ability estimates) for NAEP. The plan does not go into detail on the estimation option(s) to be used with this program. The uses of the item parameter estimates do not have major consequences. Estimates from the pilot test will not provide the basis for normative information to be reported to examinees, nor for the final equating of scores from alternative forms, each of which would be a significant use. Rather, item parameter estimates from the pilot test will be used to support construction of forms that are roughly equal in difficulty and accuracy so that form calibrations based on subsequent field test results will be feasible. In the workshop review of the sampling plan, some concern was expressed about the possible underrepresentation of students from small rural schools and possibly also from private schools, where the number of students in the target grades was below the target for the number of students tested per school. (An average of about 42 4th-grade students would be tested from each school selected for the reading pilot and an average of about 56 8th-grade students would be tested from each school selected in the mathematics pilot.) We understand that this concern will be resolved or clarified in a subsequent revision of the sampling plan. In any event, this is a relatively minor concern for the pilot test, in which no attempt is being made to develop test norms or to equate alternative forms. Plans for Pilot Test Form Design and Assignment Pilot test plans call for the assembly of 18 distinct forms of mathematics items and 24 distinct forms of reading items. (We presume that each reading passage will be used in two different forms, with different sets of questions for each use.) The plan calls for the creation of a number of “hybrid” forms (22 in reading and 28 in mathematics) that consist of the first half (45-minute session) of one form paired with the second half of another form. Each form will resemble an operational form insofar as possible with respect to length and administration time, distribution of items by content and format, and distribution of items with respect to other factors (such as calculator use). To reduce risks associated with compromise of test security, the number of different items administered in any one school will be limited to one-third of the total set of pilot test items. Schools will be divided into four clusters. Two forms will be administered in all four clusters to provide a basis for linking the performance scales developed for each cluster of schools and forms. The remaining forms will be assigned to only one of the four school clusters. Hybrid forms will be similarly assigned to specific clusters of schools. Within each school in a specific school cluster, a total of six intact and six hybrid mathematics or eight intact and eight hybrid reading forms will be distributed to students in a spiraled fashion. Spiralling is an approach to form distribution in which one copy of each different form, from first to last, is handed out before “spiralling down” to a second copy of each form and then

OCR for page 35
--> a third and so forth. The goals of this approach are to achieve essentially random assignment of students to forms while ensuring that an essentially equal number of students complete each form. The current design involves a wide range of assumptions about time requirements, student endurance, and other aspects of test administration. The pilot test affords an opportunity to test some of these assumptions at the same time that data for item screening and calibration are collected. In addition, trying out items in forms that differ from their operational use (in length or content context) may introduce additional source of error in parameter estimates. An earlier version of the pilot test plan included an option for schools to participate on a more limited basis, with each student taking only half of a complete form (e.g., only one 45-minute testing session). We hope that this option is no longer being considered, as it would create several problems. For example, the proposed DIF analyses require sorting students into groups by overall level of performance and comparing the passing rates within groups. We question the accuracy with which students who take only half a form would be assigned to these groups. But, if students who take only half a test are excluded from the DIF analyses, the sample sizes might not be adequate to support such analyses. These issues would have to be addressed if the “half-test” option is reconsidered. The plan for limiting item exposure to specific clusters of schools introduces significant complexity into the pilot test design. Such complexity appears warranted because, even if the likelihood of test item compromise is not high, the consequences would be very large. It would be useful to know what other measures are planned to ensure test security. Will each booklet have a unique litho code so that missing booklets can be identified and tracked down? How will each test session be monitored? How will materials be shipped and stored? The stated reason for having hybrid forms is to explore “susceptibility to context effects” and reduce “the impact of locally dependent items in the pilot test calibration” (American Institutes for Research, Designs and Item Calibration Plan for the 1999 Pilot Test, July 24, 1998). Since each half-form (45-minute testing session) remains intact, the potential for introducing context effects or local item dependence appears minimal. In reading, the plan is for a particular passage type to always appear within the same session and in the same position. In this case, item position effects are not an issue. For mathematics, item positions are not fixed, so items used in one position in the pilot test could be used in a quite different position in an operational form. A more effective design for addressing this issue would be to create alternate versions of each mathematics form with the same items presented in reverse order. We believe that the reason for using hybrid forms, while not explicitly stated, is to improve the degree to which parameter estimates for items appearing in different forms can be put on the same scale. If hybrid forms are not used, there is not overlap across forms: the only way to link the parameter estimates for different forms within a given school cluster is through the assumption that random assignment of forms to students eliminates the need for further adjustment. Item calibrations would be performed separately for each form, setting the underlying performance scale to have a mean of 0 and a standard deviation of 1 (or any other desired constants). For random samples of 800, the sampling error for mean performance would be about .035 standard deviations. With spiraling, however, a (nearly) equal number of students take each form within each school, eliminating between-school differences in sampling error. Consequently, the standard error of differences between the performance means of the samples of students taking different forms would be much less than .035. This level of error seems modest and perfectly adequate, given the intended uses of the pilot test item parameter estimates, particularly in light of other sources of variation (e.g., context effects, differences in student motivation). Under the hybrid form design, only 400 students would complete each distinct form. A

OCR for page 35
--> key question is whether the error in adjustments to put the 400 student samples on a common scale would not be greater than the sampling error that this approach seeks to reduce. Differences between school clusters might be more significant. The use of an anchor form to identify differences in the performance distribution of students from different school clusters appears prudent. The use of two anchor forms, as suggested by the contractor's Technical Advisory Committee and reflected in the revised plan, appears even more prudent. It would still be reasonable, however, to expect a more explicit discussion of the level of error that could be encountered without such an anchoring plan and the degree to which the use of anchors will reduce this error. Each cluster includes from 86 to 149 different schools; with the careful assignment of schools to clusters, the potential for significant differences in student performance across clusters does not appear to be great. The use of hybrid forms will introduce a number of problems that are not specifically addressed in the current plans. In conducting DIF analyses, for example, it might be problematic to use an observed total score (with or without the item in question removed) as the basis for conditioning on overall performance. Because students taking hybrid forms take the first half of one test form and the second half of another, it is not possible to calculate total scores for use in analyzing differential item functioning. This form assignment model effectively splits student sample sizes in half and reduces the power to detect significant differences in item functioning. Presumably, students would be sorted on the basis of item-response-theory ability estimates (theta) rather than observed scores. The weighting of item types could vary from one form to another, and the estimation of performance may be poor for some students with discrepant response patterns. A demonstration of the approach to be used would be appropriate. The use of hybrid forms also significantly increases the complexity of form distribution within a school. If only intact forms are used, either six or eight different forms would have to be distributed to the (up to) 50 students participating from the school. With hybrid forms, the number of different forms to be distributed would increase to 12 to 16. Random assignment of students to form will work because differences in student performance average out over a large number of students. When there are more forms, there are fewer students per form and thus a somewhat greater degree of sampling error. Summary and Conclusions Given the goal of assembling six operational forms from the pilot test item pool and current plans for review and revision of items prior to the pilot test, we find the number of items to be piloted to be appropriate, reflecting relatively conservative assumptions about item survival rates. The proposed sample size and sampling plan is fully acceptable for meeting the objectives of the pilot test. We believe the sample size to be adequate, even in the absence of some detail on the approach to estimating item parameters. We strongly endorse the plan to create pilot forms that resemble operational forms to the maximum extent possible. We also endorse the plan to limit item exposure within any one school. We are not convinced that the complexity of the design for hybrid forms is justified by potential gains in statistical accuracy. Overall, we believe the plans for pilot testing VNT items to be generally sound. The number of items to be piloted and the proposed sample sizes appear entirely appropriate to the goals of creating six operational forms that meet test specifications and are adequately pre-equated. We have raised several questions about details of the design and the plans for analyzing the resulting data. The issues raised are

OCR for page 35
--> mostly details and do not affect the recruiting of schools, which must begin soon. We believe there is sufficient time for revisions and clarifications to the plan and consideration of any proposed changes by the National Assessment Governing Board (NAGB) at its November 1998 meeting. Any remaining issues could be resolved with ample time to proceed with implementation of the pilot test as planned in March 1999. Recommendations 4-1. Back-up options should be planned in case the survival rates of items in the pilot test are lower than currently estimated. If pilot test results lead to elimination of a larger than expected number of items, NAGB will have to consider back-up options for constructing and field testing test forms. Alternatives might include: reducing the number of new test forms to be included in the first operational administration; relaxing the item screening criteria to allow greater use of statistically marginal items; finding additional sources of items (e.g., NAEP or commercial item pools); and delaying the field test until further items can be developed and screened. Although we believe that such options are not likely to be needed, planning for unanticipated outcomes is prudent. 4-2. Prior to any large-scale data collection, the discussion of analysis plans for the pilot test should be expanded to provide a more explicit discussion of: (a) the item-level statistics to be estimated from the pilot test data, (b) decision rules for screening out items based on these statistics, (c) how the statistics on surviving items will be used in assembling operational forms, and (d) the rationale for the level of accuracy that will be achieved through the proposed data collection design. 4-3. The plans for hybrid test forms should be dropped, or the rationale for using them should be specified in much more detail and be subject to review by a broad panel of psychometric experts. On the basis of our understanding of the hybrid forms, we recommend not using them. 4-4. The procedure that will be used to assign items to pilot test forms should be described in more detail. Questions such as the following should be addressed. Will assignment be random within content and format categories or will there be attempts to balance pilot forms with respect to other factors (e.g., the issue of potential “form bias” identified in the subcontractors' item bias reviews)? What sort of review will be conducted to identify potential problems with duplication or cueing (the text of one item giving away the answer to another)? Field Test Plans Our review of the field test plans is, necessarily, less extensive than our review of plans for the pilot test. The plans developed to date are preliminary; the current contract schedule calls for revised field test plans to be developed by the contractor and reviewed by NAGB toward the end of fiscal 1999 (September 30). Furthermore, the determinations that NAGB must make about item quality will be based on pilot test data. The field test is designed to collect and evaluate information about whole test forms; further decisions about individual items will not be made on the basis of data from the field test.

OCR for page 35
--> Our review of the field test plans is based on the documents provided by NAGB and AIR for review at our April workshop: Designs and Equating Plan for the 2000 Field Test, April 9, 1998 Linking the Voluntary National Tests with NAEP and TIMSS: Design and Analysis Plans, February 20, 1998 We have not received or reviewed any subsequent versions nor are any scheduled for development and review until summer 1999. The field test plans that we reviewed were similar in content to the pilot test plans. They describe the number of forms that will be fielded, the size of the sample of schools and students to whom each form will be administered, and, in broad terms, the analyses to be performed. Forms Plans call for six forms (designated A through F) of each test to be included in the field test. Four forms would be targeted for operational use—an operational form for each of the next 3 years and an anchor form to be used in all 3 years for equating purposes. In addition a “research form” would be developed for use in future research studies, including checks on the stability of linkage over time. A sample form would also be field tested and equated and then given out in advance of the first operational testing to provide users with an example of an intact form. The plan calls for the first two forms—the Year 1 form (A) and the equating form (B)—to be treated differently from the other forms. They will be administered to larger samples of students (see below) in a separate equating cluster. No reason is given why the Year 1 operational form needs greater precision in equating than the Year 2 and Year 3 operational forms. Plans for the use of the anchor form (B) are sketchy at best and do not include adequate rationale. Plans also call for field testing hybrid versions of the operational forms. These would be combinations of the first half (testing session) of one form and the second half of a different form. The use of hybrid forms assumes that appropriate IRT scoring methods will be used to calculate scores from which the achievement-level classifications will be made (with estimates derived using IRT pattern matching).2 The hybrid form approach attempts to maximize accuracy in calculating individual item statistics by controlling for differences in the samples of students receiving different forms. The primary purpose of the field test, however, is to examine test score statistics, not item statistics. NAGB and its contractor have not yet publicly decided whether IRT pattern scoring or observed total correct scores should be used as the basis for achievement-level classifications. Sample Size The current field test plan calls for sample sizes of 1,000 for each intact and 1,000 for each hybrid version of the research and public release forms and the forms targeted for operational use in Years 2 and 3. Considerably larger samples would be used for the Year 1 and anchor forms. For the reading 2    With IRT pattern scoring, the credit a student receives for a particular correct response depends on the examinee's pattern of responses to other related items (see Lord and Novick, 1968).

OCR for page 35
--> forms, 4,500 students would complete each of the intact and hybrid versions, and twice as many students would complete the mathematics forms. The proposed sample sizes for equating alternative operational forms are consistent with common practice. Samples of 2,500 test takers per form are used to develop an initial equating of new forms of the Armed Services Vocational Aptitude Battery (ASVAB), for instance. The ASVAB is a relatively high-stakes test, used to qualify applicants for enlistment, so the initial equating is subsequently verified by a re-equating, based on operational samples of about 10,000 applicants per form. There are two rationales for the larger sample sizes proposed for Forms A and B. First, they may be used to collect normative information, even though plans for norm-based reporting have neither been proposed nor approved. It seems likely that NAGB will want to use data from NAEP as a basis for providing much more extensive normative information. It is also possible that normative information could be constructed from the data collected for all six forms if forms are adequately equated. The other rationale for large sample sizes for Forms A and B is that they will be used separately to develop the linkage to NAEP achievement levels. The even larger sample sizes proposed for mathematics may be based on the desire to link mathematics scores to the TIMSS scale, as well as to NAEP. It is not clear why this linkage needs to be based on two forms and not either one or all of the forms, presuming an adequate equating of forms. Equating Cluster Design The current plan calls for the use of three separate equating clusters so that no more than half of the forms are administered in any one school. The need for test security is incontrovertible. Indeed, it is so strong that the plans for administering the tests must ensure that not even one form is compromised at any location. So long as such procedures are in place, the added complexity of the equating cluster design may be unnecessary. A far simpler design would be to randomly (through “spiraling”) assign students within each school to the six forms. Analysis Plans. Preliminary analysis plans were developed by the contractor and presented to NAGB's design and methodology subcommittee at its March 1998 meeting. These plans, which were also reviewed in our April 1998 workshop, are necessarily preliminary. As noted in Chapter 6, decisions about scoring and scaling procedures have not been made. The draft equating plan describes procedures for putting item parameters on a common scale. This approach is consistent with the use of IRT scoring. If NAGB adopts a simpler approach on the basis of total scores, item calibration will not be needed as a step in equating. Preliminary plans for linking VNT scores to NAEP and TIMSS are generally consistent with recommendations of the Committee on Equivalency and Test Linkage (National Research Council, 1999c). Plans call for administration of each of the measures to be linked under conditions that are as identical to operational use as possible. Students will take the NAEP assessment in February, the VNT in March, and TIMSS in April, using the administrative procedures associated with each of these assessments. Attempts will be made to account separately for sampling, measurement, and model misspecification errors in assessing the overall accuracy of each linkage. Differences in the linkages across different demographic groups will be analyzed. The initial proposal also includes efforts to monitor stability over time. In other chapters of this report, we note that accuracy targets have not been set in advance in

OCR for page 35
--> specifying test length or form assembly procedures. Similarly, current plans do not include targets for equating and linking accuracy. The accuracy with which students are assigned to achievement levels will depend directly and perhaps heavily on equating and linking accuracy, so further accuracy goals are needed. Summary and Conclusions The field test plans that we reviewed lacked sufficient rationales for several elements of the proposed design, including the use of hybrid forms, the use of equating clusters, and differential sample sizes for different forms. A key problem in creating greater specificity is that plans for scoring the operational forms have yet to be discussed. In addition, accuracy targets for reporting—and hence for equating—also do not yet exist. We find insufficient justification for the disparate treatment of the Year 1 (A) and equating (B) forms and the other forms (C–F). The hybrid forms design is totally incompatible with the use of total correct scores and may not even be a very good idea if IRT pattern scoring is to be used. Recommendations 4-5. Plans for VNT scoring should be developed, reviewed, and approved prior to completion of revised field test plans. Scoring plans should specify whether total correct or IRT pattern scoring will be used and should indicate accuracy targets that would provide a basis for determining the accuracy with which alternative forms must be equated. 4-6. Only intact forms (i.e., not hybrid forms) should be used. Whether or not IRT scoring is adopted, the rationale for the use of hybrid forms is weak at best. 4-7. Unless a much stronger rationale is presented, the goal should be to equate all forms with the same accuracy, and plans for different sample sizes for different forms should be changed. 4-8. Analyses should be conducted to demonstrate the level of accuracy of equating results for targeted sample sizes. NAGB should review and approve equating accuracy targets prior to adoption of the final field test plans. 4-9. The final field test plans should include an evaluation of the need for separate equating clusters. Unless a strong need for the separate clusters is demonstrated, the sample design should be simplified.