Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 17
A Valedictory: Reflections on 60 Years in Educational Testing APPENDIX: SAMPLING AND STATISTICAL PROCEDURES USED IN THE CALIFORNIA LEARNING ASSESSMENT SYSTEM Report of the Select Committee Lee J. Cronbach (chair) Stanford University Norman M. Bradburn University of Chicago and National Opinion Research Center Daniel G. Horvitz National Institutes for the Statistical Sciences July 25, 1994
OCR for page 18
A Valedictory: Reflections on 60 Years in Educational Testing Lee J. Cronbach 850 Webster St. #623 Palo Alto, CA 94301 July 25, 1994 The Honorable William D. Dawson Acting State Superintendent of Public Instruction 721 Capitol Mall Sacramento, CA 95814 Dear Mr. Dawson: I have the honor to transmit the report of the Select Committee appointed to review the methods of sampling and scoring tests and producing school and district scores in the California Learning Assessment of 1993 and the plans for Spring 1994 assessments. CLAS and its contractors have embarked on an unprecedented and imaginative project. They have had to accomplish much in a short period of time, and they have done many things well. They are probably as well along the road to satisfying the demands made on State assessments by the new Federal Goals-2000 legislation as any organization. Problems arose in carrying out the 1993 plan. In this innovative measurement, traditional formulas and modes of thinking break down; other major assessments are encountering similar difficulties. A risk was taken when the State moved so rapidly to official reports on schools. The 1993 assessment pioneered many techniques and operations. It was highly successful as a trial, just because it uncovered so many previously unrecognized problems in large-scale performance assessments. You asked about the plan with which CLAS-1993 went into the field. The test development was praiseworthy, so far as we can judge. A limited budget precluded scoring of all student responses, and the CLAS plan distributed scoring resources intelligently over schools. Those developing the plan made unreasonably optimistic estimates as to the accuracy that the resulting school reports would have. Even if perfectly executed, the plan would have produced unreliable reports for a great number of schools. As the plan was carried out, data were lost. Matching student identification sheets to test booklets was seriously incomplete, and the responsibility is shared all along the chain from printer to school to receiving station to the contractor's sorting line for choosing responses to score. Inadequate review under time pressure led to the release of some reports that were manifestly untrustworthy. Any assessment result is somewhat uncertain, because testing cannot be exhaustive. The uncertainty arising from using only one or two performance tasks per student, and having each scored only once, was unsatisfactorily large in CLAS-1993. For example, in a 60-student fourth grade where every student was scored, 30% might be counted in the superior range. Because of measurement error, we can conclude only that the percentage truly of superior ability probably was between 19 and 41 -- an imprecise finding. Scoring only 31 papers (70% of the CLAS scoring target) would have added only a fraction to the uncertainty, the band becoming 16-to-44. Sampling shortfalls below the 70% level did increase uncertainty enough that validity and comparability
OCR for page 19
A Valedictory: Reflections on 60 Years in Educational Testing across schools suffered. Disregarding continuation schools in Grade-10 data, the shortfall reached this level in about 3% of schools. The operational problems bespeak the difficulty of managing such a complex project. The present CLAS management structure makes quality control difficult; we recommend a new structure centered on a prime contractor. In addition to listing points where quality control is needed, the Committee has suggested many detailed changes in sampling rules, scoring rules, and reporting; we expect these to reduce measurement errors and errors of interpretation. Our evaluation of plans for CLAS-1994 is limited because major changes were being made week by week as the Committee did its work, and we have not taken into account decisions made after May 13. On some points you asked about, CLAS is still reviewing alternative plans, and indeed in this report we suggest research that should be done before the reporting plans are made final. Major changes in the handling of papers returned from schools in 1994 have been installed. If there is thorough supervision, errors in selecting papers for scoring should be few. Some parts of the system could not be changed after the 1993 problems became clearly defined. The Committee recommends that CLAS concentrate 1994 scoring and reporting on verifying that, with the changes already under way, it can produce consistently dependable school-level scores. We recommend that reporting of scores for individual pupils be limited to a trial run in a modest number of schools. Extension of the project to individual reporting will no doubt uncover problems not recognized to date, and errors in reports on students can do far more harm than errors in reports on schools. The Committee applauds CLAS's success in constructing tests that assess reasoning and its success in obtaining cooperation from the State's educators. All the shortcomings of CLAS-1993 can be remedied, within the inexorable limits of the time available for testing and the costs of scoring. CLAS learned a great deal from its experience to date, and we hope that it can learn further from this report. CLAS, as it matures, should be able to deliver a highly useful product. Yours truly Lee J. Cronbach Vida Jacks Professor of Education, Emeritus Stanford University
OCR for page 20
A Valedictory: Reflections on 60 Years in Educational Testing CONTENTS Executive Summary 22 The Promise and Challenge of CLAS 29 How the Committee Proceeded 33 The Number Describing School Performance and Its Uncertainty 34 A recommendation on school reports 34 Standard errors and confidence bands 35 Nonresponse Bias 37 Sampling of Students as Policy and Practice 40 The 1993 scoring targets 41 The shortfall in scoring 44 Operational Problems: Their Nature and Causes 47 Loss of data 47 Breakdowns in the management of documents 48 Recommendations on administration and quality control 49 Analysis and Reporting at the School Level 51 Validity issues 51 Scoring 54 Reliability of school scores 60 Scores for Individual Students 67 The need for equating 67 Reliability for individuals 67 A Final Recommendation 71 Addendum 72 End Note 74
OCR for page 21
A Valedictory: Reflections on 60 Years in Educational Testing TABLES Table 1. Number of schools at each level of student participation 39 Table 2. Samples considered necessary in Grade-4 Writing under alternative decision rules 42 Table 3. Booklets scored as percentage of target 45 Table 4. Components contributing to the uncertainty of the school score 61 Table 5. Which contributions to the standard error are reduced by possible changes in measurement design? 61 Table 6. Variance contributions that are treated as constant over schools 63 Table 7. Variance contributions that are affected by n and N 64 Table 8. Increase in RD-4 standard error with scoring of fewer students 66 Table 9. Risk of misclassification associated with various SEs 69
OCR for page 22
A Valedictory: Reflections on 60 Years in Educational Testing Executive Summary In 1993, the California Learning Assessment System (CLAS) administered an ambitious examination emphasizing performance measures of achievement rather than multiple-choice questions. Tests in Language Arts and Mathematics were administered to a very high proportion of California students in regular Grade 4, 8, and 10 classrooms in 1993. The examination exercises represent a major effort, generally successful. The specimens the panel has seen appear to be in harmony with recommendations of National Councils in English and Mathematics and of the Mathematical Sciences Education Board. CLAS-1993 reflects the emphasis of those groups on problem solving and effective communication. DIFFICULTIES ENCOUNTERED IN 1993 The legislation creating CLAS anticipated that development and implementation would be gradual, and that problem-solving over some years would be required to accomplish all the goals of the legislation. It is remarkable that CLAS has achieved so much by this time. Still, CLAS and similar assessments are in uncharted waters. CLAS has experienced difficulties of two kinds. Some are operational problems that probably would have been foreseen by a more mature organization, having more experience in management of complex surveys and giving more thorough attention to technical planning. Other difficulties are inherent in the new types of assessment, which face unprecedented problems related to test construction, sampling, scoring rules, reporting, and statistical analysis. These problems are discovered as CLAS or another assessment encounters them, and that is why each trial run is a major learning experience. Many of the practices we question are of this second type, not to be criticized as deficiencies in CLAS management. CLAS-1994 had made some improvements before our Committee began work and is making changes with each passing week. The activity itself is evidence that CLAS is moving toward maturity. Scoring performance measures is costly, so CLAS-1993 could score only a sample of responses. The sample was intended to be sufficient to warrant reliable school-level
OCR for page 23
A Valedictory: Reflections on 60 Years in Educational Testing summaries of student performance. Shortly after school reports appeared, it was noted that the numbers of students scored in some schools were far smaller than called for in the plan. The Committee, appointed late in April 1994, was charged to evaluate the sampling of booklets for scoring by CLAS-1993, and also to consider its reliability and validity and, prospectively, to consider the planned 1994 methodology and to suggest improvements for the future. NONRESPONSE In some schools testing was seriously incomplete. Nonresponse can bias the statistics on a school. Sound statistical techniques for reducing bias due to nonresponse are available and CLAS should use them in 1994 and thereafter. It should also try to reduce the nonresponse rate. A number of test questions were protested by members of the public. We have no reason to think the protested tasks invalid, but some parents did refuse to let their children take the test in 1994. Wherever defections were frequent, the sample could be unrepresentative, making the 1994 assessment of that school somewhat invalid. The extent and impact of defection in 1994 should be tracked. THE SAMPLING PLAN AND ITS EXECUTION A fraction of student responses, chosen roughly at random, were scored and used in calculating school statistics. Sampling is a proven and cost-effective method for surveying achievement. Because samples differ, a summary statistic calculated from a sample leaves some uncertainty about the student body's true performance level. This is “sampling error.” Additional uncertainty comes from “measurement error”, present even when all students are tested and scored. Measurement error arises from the difficulty of the particular tasks assigned, and from transient factors such as how well a student felt on the day of the test. The usual index of uncertainty is the “standard error” (SE). It is the scientific criterion for choosing a sample size, and it should be calculated in a way that recognizes both types of error.
OCR for page 24
A Valedictory: Reflections on 60 Years in Educational Testing The plan for CLAS sampling in 1993 followed a tradition when it looked at sampling error only; but even as a control on sampling error, the plan rested on judgment and not on a scientific rationale. Though the plan did not hold down errors to a satisfactory degree, it did use the available scoring budget rather efficiently. Many errors were made in carrying out the plan. Schools lost materials, and contractors made missteps. Shortfall below 70% of target appreciably increased SE's for a school. We consider scoring less than 70% of the target a serious shortfall, and that occurred in about 3% of schools (setting aside the higher figures for continuation high schools). SCORING OF TESTS Performance tests obtain open-ended responses. In CLAS these are judged on a 6-point scale. To be useful, the scale must mean the same thing from year to year (apart from clearly announced changes). Novel problems arise when, as in some CLAS tests, open-ended and multiple-choice sections are combined to form one score. The Committee suggests studies to verify that the CLAS combining rules are solidly based. Attention to these issues is especially critical because attention will be paid to change in a school's scores from year to year. SCHOOL REPORTS Recommendation: CLAS should focus attention on reducing measurement error in its instruments. CLAS-1993 provided results for a school alongside those for a set of schools having similar demographic profiles. Encouraging schools to compare themselves with similar schools is an excellent practice. The CLAS-1993 report for a school stated the percentage of students at each performance level, and attached a standard error to each percentage. Those SE's are not interpretable. The natural way to answer the question “Did our school do better than most?” is to combine percentages above some point, for example at levels 4, 5, and 6
OCR for page 25
A Valedictory: Reflections on 60 Years in Educational Testing combined. This summary percentage is readily compared across schools, and a meaningful SE can be attached. Few school-level reports in 1993 had adequate reliability. Our benchmark for adequate precision was a “confidence band” of 8 percentage points (achieved with an SE of 2.5% or below). This promises that if 36% of students scored in a school are in the high-rated group, we are “nearly certain” that the true percentage is in the range 32-40%. (“True” refers to the proportion that would be obtained by complete and accurate measurement. “Nearly certain” means that the true proportion is unlikely to fall outside the stated range for more than 10% of all schools.) What level of uncertainty is acceptable is not a technical decision. It is to be made by the political and educational communities, keeping in mind the tradeoff between cost and precision. But surely those communities would agree that the uncertainty in CLAS school-by-school results should be reduced from the present level. All large performance assessments with which the Committee is acquainted are struggling to reduce measurement errors. This difficulty has become fully evident only in the past year. Large SEs are the consequence of allowing no more than one or two hours of testing time per area for examining on a range of complex intellectual tasks. The Committee adapted available theory to appraise the error arising in the novel CLAS design, and to do so pushed the limits of present statistical and measurement theory. It is not surprising, then, that in 1993 no one anticipated how large the measurement errors would be. Changes in design that could reduce one or more sources of uncertainty, and so reduce the overall SE, are these: scoring more papers per school, entering more test forms into the spiral design, making test forms more comparable, requiring more time so that more tasks are administered, improving the consistency of scoring, and modifying the scoring rules. Increasing the number of responses scored, whether by scoring more students or more tasks per student, is especially costly. Increasing the number of forms has particular value; the evidence on this argues against publication of tasks suitable for reuse in 1995. Taking as the summary for a school the percentage of students at high performance levels may be unwise. It appears that the average performance-level score in the school can be estimated with less uncertainty than such a percentage.
OCR for page 26
A Valedictory: Reflections on 60 Years in Educational Testing OPERATIONAL PROBLEMS AND MANAGEMENT Recommendation: CLAS should be administered through a prime contractor using subcontractors. The prime contractor should have clear responsibility to ensure quality control. CLAS staff should focus on improving the design of the assessments, their accuracy, and the efficiency with which they are carried out. Recommendation: CLAS should make all statistical design, sampling, and estimation the responsibility of an expert survey statistician. It should add to its staff a full-time technical coordinator who would work with the statistician and with an expert study group, to analyze and interpret evidence on the accuracy of CLAS. CLAS operations are vast and complex. The necessary work included development and production of tests and other essential documents and forms; shipping of materials to and from each school; scoring and statistical analysis; and finally, reporting. Complexity on this scale makes explicit and complete quality-control procedures vital. These were inadequate in CLAS-1993. Large projects such as CLAS are usually carried out under a prime contractor, who delegates tasks requiring special resources. The prime contractor holds each subcontractor responsible for completing each prescribed task on time and with specified quality. The prime contractor becomes responsible for the quality of all the work. Contractors should be required to implement quality control and/or assurance procedures for each of their distinct tasks and to generate periodic reports, based on these procedures, that reflect the extent to which the specified criteria are being met. If such procedures are not in place for 1994, many of the 1993 problems can be expected to reappear. The management structure of CLAS-1993 was not much like this systematic model. There was no prime contractor; CLAS awarded at least three separate main contracts. CLAS staff undertook to coordinate the work of the contractors, provide overall management, and monitor the quality of the diverse products. The small CLAS staff could not do all this thoroughly. Operations went awry at several points in 1993. Although fewer than 4% of the schools had data losses so severe that the original reports were officially called into question, partial data losses were widespread. These losses and the release of some undependable reports came about largely because an inadequate management structure provided poor quality control.
OCR for page 27
A Valedictory: Reflections on 60 Years in Educational Testing The recommendation on technical staffing reflects the Committee's awareness of the difficulties—in this exploratory venture—of designing sampling plans, adjusting for nonresponse, estimating standard errors, interpreting statistical findings, and improving measurement design. Will the quality of school reports be better in CLAS-1994? Many changes in technology and organization for data processing have been made by the contractor, so mistakes should be greatly reduced. The CLAS staff now is considering the suggestion of reporting school means alongside the percentage distribution. The test has been lengthened in several areas, and improvements in the sampling plan have been installed. More responses are being scored, and we have been told of plans to improve scorer agreement. Accuracy will be better, then, in CLAS-1994, though the experience will no doubt indicate places for further improvement. REPORTING INDIVIDUAL SCORES IN 1994 Recommendation: Reporting of scores for individual students should be limited in 1994 to experimental trials in a few schools or districts. CLAS-1994 has planned to report on individual students in Grade 8. The number of exercises was increased so as to reduce student-level SE's. It appears that there will be few “two step” errors in classifying individuals—of a true “Level 3” student being called “1” or “5”, for example. One-step misclassifications will be frequent. This high rate would be unacceptable wherever significant rewards and penalties are attached to the scores. The move to report individual scores seems premature. A design superior for assessing schools creates difficulties at the individual-student level, and vice versa. Having different students take different test forms improves school reports, but the luck of the draw determines whether a student gets a comparatively easy test form or a hard one. CLAS will need a way to allow for this inequity. Allowing for unequal opportunity to learn is another important concern. A further year of preparatory work would permit attention to these issues and to the way students, parents, and teachers react to and use reports on individuals. If delay in individual reporting requires reversal of decisions previously made by the State Board of Education and the Legislature, we recommend such reversal.
OCR for page 69
A Valedictory: Reflections on 60 Years in Educational Testing Table 9. Risk of misclassification associated with various SEs SE 0.8 0.7 0.6 0.5 Probability of misclassification (at least one-step) 0.56 0.51 0.45 0.39 Probability of two-step misclassification 0.08 0.05 0.02 0.01 ments; we are not recommending that State decision makers adopt that standard, because the decision is in the best sense a political one. And we would not speak of an SE of 0.7 as “tolerable” when and if the CLAS report affects the student more strongly than seems likely this year, or when important decisions rest on the CLAS score without regard to other information in the school record. One-step misclassifications would then be a matter for concern. Higher stakes will make the reporting of confidence bands imperative. We warn of an additional pitfall. Any teacher who compares or ranks students on the basis of PL values is likely to go astray. Students who took the same form can fairly be compared. Unless and until an equating system is in place, CLAS will be unable to compare fairly two students who took different forms. We do not know what CLAS can do to improve 1994 scoring at this late date. But our earlier proposal to allow intermediate marks such as 3.4 or 3.5 is one inexpensive step. When a judge who believes that a response deserves a 3.5 is forced to call it a “3” or “4”, the report is false by one-half point. Such distortions mount up quickly in the error variance. Statistical summaries of scorer disagreement should be performed as 1994 scoring proceeds, with particular attention to whether errors at certain scoring sites are relatively large. 18 A letter from a school principal astutely suggests also a study of scorer agreement early and late in a long day. 18 The appropriate computation would obtain the variance between raters within papers, or, in a crossed design, the usual p, r, and pr components. The correlation coefficient and counts of agreements do not address the question adequately.
OCR for page 70
A Valedictory: Reflections on 60 Years in Educational Testing Estimated standard errors Estimating the likely SE for student reports requires assumptions. SEs will be lowered if our proposal to record intermediate ratings is adopted, but we have statistics only from tryouts with whole-number scoring. The Committee does not know how multiple-choice (MC) and open-ended (OE) scores will be combined in Reading, or how the new Constructed Response section will be weighted in Math. Decisions such as these can change the practical meaning of the PL scale and hence the meaning of an SE expressed in those units. Writing is the simplest area to deal with. The plan is to score two tasks for every eighth grader. We assume that the PL score will simply be the average of two integers in the 1-to-6 range. From 1993 “pilot study” data, we estimate the SE in Grade 8 to be 0.55.19 This appears to be a satisfactory level of accuracy for reporting student scores. Reading in 1994 has two parts. There is an open-ended (OE) section of two tasks. Again presuming simple averaging, analysis of a data set like that used for Writing gives an SE close to 0.5. This number is again on the 1-to-6 PL scale. The multiple-choice (MC) section is somehow to be merged with OE to get the final PL. The final SE will surely be lower than 0.5 (if MC is placed on the same scale and the final score is a weighted average.) Because the SE of 0.5 is in the acceptable range, we have not tried to forecast how much combining the parts improves the SE. Thereby we avoid guesswork about a mapping process not yet developed. In 1994 Mathematics, there are 2 OE, 7 MC, and 8 Constructed Response (CR) tasks. The squared SE for OE is estimated at 0.2320 using the 1-to-6 scale. That for MC is 0.38 using the students' scores on a percent-correct scale. In 1993, each 14% correct was counted as roughly equivalent to one-half point on the 1-to-6 scale, implying a squared SE on that scale of 0.48. Now we must speculate. Suppose that CR has the 19 This is based on a calculation made for the panel by CTB on June 2, 1994. The PL scores for two tasks were entered in a pupil by task generalizability study. (See Shavelson and Webb, Generalizability theory: A primer, pp. 27 ff. [Newbury Park, CA, Sage, 1991].) The scores apparently were in integer form despite being weighted averages of RE and WC. About 4000 students, distributed over 10 tasks, also took one common form. Scores for each group of students taking one variable form were analyzed; then findings were averaged over groups. In all, about 8000 students entered the calculation. An important technical fact: Analyses made for the panel and delivered by CTB on June 17, 1994 indicate that the SE is essentially uniform for WR-8 PL's at levels from 2 to 5. The same statement holds for RD-8. This supports an assumption underlying our statements about confidence bands. (There are too few 1's and 6's for a conclusion about SEs at the extreme.) 20 This is based on Tables 4.43 and 4.44 of the Draft Technical Report.
OCR for page 71
A Valedictory: Reflections on 60 Years in Educational Testing same SE as MC, and that as a first step those scores are averaged, yielding a squared SE of 0.24 (because a longer test has less error). Then suppose that that composite is averaged with OE. We wind up with a speculative SE of or 0.35. The estimated errors for all three areas are in the same ballpark. In rethinking the mapping rules CLAS will perhaps, in effect, stretch the scales, so as to counteract the shrinkage inevitable when more tasks are averaged. We doubt that the stretching will raise the SE beyond 0.7. Although we have had to engage in patchwork reasoning on one of the most serious questions before the panel, we conclude that the errors in 1994 Grade-8 student scores are tolerable in all three areas. It is of course imperative that CLAS determine the SEs and confidence bands accurately when judging and mapping are complete, and that this information on uncertainty be communicated effectively. A FINAL RECOMMENDATION The Committee recommends that reporting of scores on individual students in 1994 be limited to experimental trials in a few schools or districts. These should be volunteers but they should represent diverse communities and a range of CLAS-1993 performance levels. If this requires reversal of decisions previously made by the State Board of Education and the legislature, we recommend such reversal. The decision is not unthinkable. At some date in 1994 CLAS did abandon its announced plan to report on individuals in Grade 4 in addition to Grade 8. The assessment community in California and throughout the nation is being pressed to deliver dependable information when the groundwork has not been laid. A well-intentioned and popular ruling can do harm if it ignores potential hazards from rapid action. CLAS-1993 was not sufficiently accurate. Even recognizing that improvements have been made and will continue to be made, CLAS-1994 is still a trial run to verify that quality control is adequate. Significant problems remain to be solved at the level of school assessment and reporting. We advise against embarking on large-scale reporting of student scores until CLAS has demonstrated its ability to deliver consistently dependable reports on schools. An inescapable dilemma: An assessment that tries to report at the school level and also at the student level must compromise. What improves the validity of one report (within a fixed amount of testing and scoring time) will impoverish the other. This policy
OCR for page 72
A Valedictory: Reflections on 60 Years in Educational Testing choice seems not to have been recognized, let alone resolved. Equating of forms and equal opportunity to learn are vital concerns. A further year of preparatory work would allow for attention to these issues and for studying the way students, parents, and teachers react to and use reports on individuals. We did find evidence of adequate reliability in pilot runs of the 1994 design, but the analysis was not based on regular field trials and rested in part on speculation about scoring rules yet to be developed. We applaud the energy and imagination that have gone into CLAS to this point. We would not wish to see confidence in its potential undercut by premature expansion and extension. ADDENDUM (by Lee J. Cronbach, December 1994) I comment here on ideas that have surfaced since the Select Committee report was prepared. They emerged in my conversations with various measurement specialists, but they represent work in progress, not documented proposals. I mention them for consideration by persons making analyses similar to ours in other contexts. We used the finite model for converting estimated variance components into school-level standard errors, but we did not use a finite correction in estimating variance components. David Wiley informs me that the finite model will be used at both stages of analysis for the 1994 CLAS. It appears that with the finite model students should be identified by the class in which they received the relevant instruction. In the analysis of CLAS 1993, a component such as the class-by-task interaction (possibly reflecting a teacher's emphasis) is included with the pupil-by-task interaction. But if all classes are appropriately represented in the sample, the ct interaction does not contribute to error in the school score. At one place in the End Note we introduced n*, the average number of pupils per form, to recognize that this varied from form to form within a school. It now appears
OCR for page 73
A Valedictory: Reflections on 60 Years in Educational Testing that the finite correction on the pf component is a function not of the simple average but of the harmonic mean. Wiley and I have examined this in a limited way but a proper algebraic proof remains to be laid out. I believe that these changes in analysis (not all of which would have been practical with CLAS 1993) would not change the conclusions of our report, although of course specific numbers would be altered. I mention also that erroneous numbers at two points in the original report are corrected in the version reproduced here. On p.37, in the second paragraph, “28%-44%” replaces “19%-52%”. And on p. 68, at midpage, “1” replaces “2”.
OCR for page 74
A Valedictory: Reflections on 60 Years in Educational Testing End Note Standard error calculations for school scores At the heart of the panel's analysis of the quality of CLAS information on schools is the standard error (SE), the index of uncertainty for school scores. To estimate this SE, the panel had to go beyond available formulas, so this note must elaborate. With an eye to the broad California audience, however, we write in as nontechnical style as the content allows. Note especially that we omit the words for “estimates of” and “approximately” and the mathematician's symbols for them, where they would be required in a technical publication. The persons mainly responsible for the decisions about analysis discussed here were Lee J. Cronbach and David E. Wiley. We acknowledge a key suggestion from Haggai Kupermintz. It will fix ideas to speak only of the Reading test in Grade 4. Each student devoted a period to answering a single question. Forms (questions) were spiralled over students. Responses were scored on a 1-to-6 scale referring to distinct “performance levels.” Although some papers were scored twice for the sake of quality control, just one of the two scores was used. Our formula evaluates the score for the school. In Reading, a cut between 3 and 4 separated off students scoring 4 or better; from this came the percentage above cut, or PAC. Student scores 4, 5, and 6 were recoded as 1, all others as 0. The analysis would apply to the original scores if an SE for the school mean is wanted. The main data for the study of error were counts of 1's and 0's for each form-school combination, in a subset of the CLAS-1993 scores. A set of many schools was analyzed together. We had as supplementary information pilot studies where each of a school's students in Grade 4 had been scored on two Reading tasks. There were several such pairs of tasks, and we analyzed one pair at a time. Features of the data that required novel analysis included the following: Multiple forms were used in any school (“spiralling” or “matrixing”). Numbers of students scored varied with the school. We judged that student bodies should be treated as finite populations, requiring use of a “finite population correction.”
OCR for page 75
A Valedictory: Reflections on 60 Years in Educational Testing We judged that forms should be regarded as samples from a large domain of acceptable tasks, and that the SE should recognize that source of variation as well as variation from sampling students. CLAS-1993 had previously reported SEs calculated for each school in turn, and the panel set out to make better estimates from within-school data.1 In some tests other than Reading, student scores on the 1-to-6 scale had been reached by nonlinear combination of scores from two types of task. There were irregularities in spiralling. Within some schools, this or that form might be taken by three times as many students as an alternative form. CLAS adjusted the score for the school (“rescaling ”), to recognize that otherwise the school report might give excess weight to easy forms, or to difficult forms.2 This adjustment was applied to the final score, not to single forms or students. The formula we developed is rooted in the statistical literature. Among relevant sources are W. G. Cochran, Sampling techniques, ed. 3, esp. pp. 388-391 (New York, Wiley, 1977); and Lee J. Cronbach and others, The dependability of behavioral measurements, esp. pp. 215-220 (New York, Wiley, 1972). The basic model regards a score as made up of a number of components. One subset described the person or group being measured; other components are sources of uncertainty or error. Our logic will be clearer if we present two stages of analysis separately. The first stage investigates three quantities: σ2(p) The variation of student true scores (those that would be obtained, hypothetically, by averaging scores from an extremely large number of forms). σ2(pf) The variation across forms of a student's form-specific true scores (an average over an extremely large number of trials). σ2(e) Nonreproducible variation (difference between the true score on a form and the performance score) arising chiefly from fluctuation in the efficiency of students and scorers. 1 This initial decision appears to have been unwise; the SE calculated for a single school is subject to excessive sampling error. In this report we turned to simultaneous calculations on large files. 2 D.E. Wiley. Scaling performance levels to a common metric with test tasks from separate test forms. Appendix to Draft Technical Report (DTR), CLAS, April 19, 1994. (Monterey, CA, CTB/McGraw-Hill). We believe that the rescaling does not systematically increase or decrease the components that enter our analysis. The basis for the belief is the fact that in any school where the same number of students take every form, the rescaling does not change the scores or the school distribution. We found no way to account for whatever small change in the PAC results from rescaling.
OCR for page 76
A Valedictory: Reflections on 60 Years in Educational Testing The last two combine to form σ2(pf,e)—referred to as “res” (for residual) in the body of the report The constants to be used in the formula are n, N, k, and m.N can be thought of as the school enrollment in Grade 4, and n as the number of students scored.3 In Reading and Writing the number of forms spiralled in a school is k. In one place we use the average number of students scored per form: n* = n/k. A standard analysis of variance produces the so-called “Mean Square within [forms]” (MSw). This estimates the sum of σ2(p) and σ2(pf,e), but these should be weighted differently. Because the two values cannot be separated in the main data, we first went to pilot study data where it was possible to separate σ2(p) from σ2(pf,e). A ratio m (= σ2(p) divided by σ2(p) + σ2(pf,e)) was calculated for several sets of pilot data.4 Values of m varied around 0.3 in RD-4, WR-4, RD-8, and WR-8. (There were no Grade-10 pilot data.) The data available suggested using 0.5 for m in MA-4 and MA-10. Then we defined these estimates: for σ2(p), m times MSw; for σ2(pf,e), (1 − m) times MSw. The first stage (incomplete) formula then is: The more students sampled, the smaller the contribution of each source to the SE; the multiplier (1/n) in the formula allows for that. The “finite correction” (1 − n/N) recognizes that if all students in a school are tested no variation arises from sampling of students. The model set forth here suggests another finite correction on the pf term, to recognize that the sample size for any form is (on average) n*. The e term should have 3 In empirical work N was based on a count of Student Information Forms, and might be less than the enrollment. 4 These studies originally produced DTR Table 4.37 and similar tables. At our request CTB recoded the Reading and Writing data to the 1/0 scale and analyzed to obtain within-school p and res components for dozens of schools where two forms had been given to the same students and scored. In Mathematics it was impractical to recode to the 1/0 scale because of the division into MC and OE sections. The adjusted p and pf,e components in the pilot analysis were observed to be nearly equal and that was the basis for setting the value of m in Mathematics at 0.5.
OCR for page 77
A Valedictory: Reflections on 60 Years in Educational Testing no finite correction. As we have no way to separate e from pf, the multiplier slightly understates the e contribution.5 The components omitted from discussion so far are these: σ2(s) The variation among true school PACs. σ2(sf) The variation in the school's true PAC from form to form. σ2(f) Variation among averages for the forms, considering all schools. The s variance represents information about schools' true student performance, and does not enter the standard error. If forms in the assessment are regarded as random samples from a domain of suitable tasks—as is customary in present performance assessments —then the selection of forms constitutes a source of random measurement error. If CLAS develops a plan identifying particular subareas of content (“strata”) within a field and specifying the emphasis to be placed on each one, and then creates sets of forms to fit that pattern, this will in subtle ways redefine what is measured. Such test construction (plus suitable pilot-study design) permits analysis that subdivides our f and sf components into stratum differences (which are then “true variance”) and task-within-stratum differences that count as error. Stratification is an important issue that no performance assessment is yet ready to deal with, so we necessarily treat the entire f and sf components as measurement error. The last step in obtaining the full SE is to calculate which is treated as constant for all schools.6 This is added to the stage-1 sum. The square root of the grand total is the SE. 5 Dan Horvitz has developed an alternative model starting with the concept of a two-stage sample. It suggests applying no finite correction to the pf term. The difference is practically unimportant because this Endnote 's correction for that term always exceeds 0.83. 6 The sf component probably varies with the school. In some circumstances it would be sensible to estimate an average value for schools of a defined type, but values calculated for single schools are unstable.
OCR for page 78
A Valedictory: Reflections on 60 Years in Educational Testing Analyses adapted to the area The analysis for Reading was straightforward, once we had found an efficient approach through trial and error. A file of Grade-4 data consisted of 222 schools where 7 or more students had been scored on each of the 6 forms (one form per student). Where there were more than 7 students per school-form combination the number was cut back at random; then analysis of variance was carried out. (Reported by CTB on July 19, 1994.) This was the basis for estimating components required in the formula. The Grade-10 analysis was similar save that there were 750 schools. The samples for all our estimates of components tended to consist of larger schools, but variance components probably have little systematic relation to school size. The Mathematics score in 1993 posed a special problem. There were 8 multiple-choice (MC) tests, and each of these was paired with its own 2 open-end (OE) tasks. For any student, one of the two OE forms was scored. The 16 combinations were not independent. It was necessary to estimate separately effects associated with MC tasks, OE tasks nested within MC, and their combination. If we label the corresponding simple effects m, and m,om, and the interactions with school as sm, and so, som, then the expression for the combined f and sf contributions changes to If the number of OE or MC tasks changes, the multipliers in the formula will change accordingly. The necessary components for Grade 4 were estimated from a file on 184 schools that had all 16 forms spiralled within the school, with at least 2 students per combination; again, larger cells were cut back to size 2. The story in Grade 10 is the same, with 98 schools and 4 students per combination. The analyses of variance were reported by CTB on July 16, 1994. In the Writing design for Grade 10, forms were again not independent. (See Draft Technical Report, Table 1.1 and Table 1.3.) Six Reading selections were spiralled over students. Students responding to a given Reading form were then assigned to one of two Writing tasks. We treated the data as if there were 12 forms, which tends to understate the SE. On the other hand, the plan recognized four types of writing (e.g. speculation, reflective essay), and these were equally represented among the 12 tasks. Taking this stratification into account would potentially reduce the SE. It is doubtful that the benefit from stratification can be assessed without a radically new pilot-study design. The
OCR for page 79
A Valedictory: Reflections on 60 Years in Educational Testing data from Grade 10 came from 351 schools, with 7 students per cell. (Analysis of variance reported July 19, 1994.) Grade 4 presented 6 forms, four calling for expressive and two for persuasive writing. Independence was violated by using the same writing question in two forms, the difference between forms being the associated reading selection. Analyzing as we did with k = 6 tends to understate the SE. Ignoring the stratification may work in the opposite direction. The file of data came from 217 schools with 7 students per cell.
Representative terms from entire chapter: