Skip to main content

Currently Skimming:

Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System
Pages 17-79

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 17...
... APPENDIX: SAMPLING AND STATISTICAL PROCEDURES USED IN THE CALIFORNIA LEARNING ASSESSMENT SYSTEM Report of the Select committee Lee ]
From page 18...
... A limited budget precluded scoring of all student responses, and the C LAS plan distributed scoring resources intelligently over schools. Those developing the plan made unreasonably optimistic estimates as to the accuracy that the resulting school reports would have.
From page 19...
... In addition to listing points where quality control is needed, the Commuttee has suggested many detailed changes in sampling rules, scoring rules, and reporting; we expect these to reduce measurement errors and errors of interpretation. Our evaluation of plans for CLAS-1994 is limited because major changes were being made week by week as the Committee did its work, and we have not taken into account decisions made after May 13.
From page 20...
... CONTENTS Executive Summary o 22 The Promise and Challenge of CIAS o 29 How the Committee Proceeded o 33 The Number Describing School Performance and Its Uncertainty o 34 A recommendation on school reports o 34 Standard errors and confidence bands a 35 Non response Bias o 37 Sampling of Stuclents as Policy and Practice o 40 The ~ 993 scoring targets 0 41 The shortfall in scoring o 44 Operational Problems: Their Nature and Causes o 47 Loss of data o 47 Breakdowns in the management of documents a 48 Recommendations on administration and quality control o 49 Analysis and Reporting at the School Level 0 51 Validity issues 0 51 Scoring 0 54 Reliability of school scores 0 60 Scores for Indiviclual Students o 67 The need for equating 0 67 Reliability for incliviJuals 0 67 A Final Recommendation a 71 Addenclum a 72 Encl Note o 74 20 i
From page 21...
... Components contributing to the uncertainty of the school score a 61 Table 5. Which contributions to the standard error are reduced by possible changes in measurement design?
From page 22...
... Some are operational problems that probably would have been foreseen by a more mature organization, having more experience in management of complex surveys and giving more thorough attention to technical planning. Other difficulties are inherent in the new types of assessment, which face unprecedented problems related to test construction, sampling, scoring rules, reporting, and statistical analysis.
From page 23...
... This is "sampling error." Additional uncertainty comes from "measurement error", present even when all students are tested and scored. Measurement error arises from the difficulty of the particular tasks assigned, and from transient factors such as how well a student felt on the day of the test.
From page 24...
... SCHOOL REPORTS Recommendation: CLAS shout focus attention on reducing measurement error in its instru' meets. CLAS-1993 provided results for a school alongside those for a set of schools having similar demographic profiles.
From page 25...
... But surely those communities would agree that the uncertainty in CLAS school-by~schooT results should be reduced from the present level. All large performance assessments with which the Committee is acquainted are struggling to reduce measurement errors.
From page 26...
... There was no prime contractor; CLAS awarded at least three separate main contracts. CLAS staff undertook to coordinate the work of the contractors, provide overall man agement, and monitor the quality of the diverse products.
From page 27...
... Having different students take different test forms improves school reports, but the luck of the draw dete~-~nines whether a student gets a comparatively easy test form or a hard one. CLAS will need a way to allow for this inequity.
From page 28...
... Within that budgetary limitation, faults in execution of the sampling plan caused additional uncertainty. Measurement errors are the main source of uncertainty; these are reduced only in part by enlarging the scoring sample.
From page 29...
... In 1993 the California Learning Assessment System (CLAS) administered tests in Language Arts and Mathematics to over threequarters of the students in three grades in California schools.
From page 30...
... First, it was tO be based primarily on tasks that require students to answer questions and solve problems in their own words, rather than picking a best answer from the choices offered. Second, the assessments were to provide scores for individual students as well as profiles of achievement for schools and school districts.
From page 31...
... SE Standard error. May refer to the standard error of a PAC, of a student's PL score, or of a school average on the PL scale.
From page 32...
... In any grade and area, several alternative test booklets are prepared. In CLAS-1993 Reading, for example, there were 6 booklets, each presenting a different selection to be read, and questions pertinent to it.
From page 33...
... Without a full~time technical coordinator in Sacramento, drawing on advice of an expert study group, CLAS will be unable to recognize and resolve the problems these innovative assessments encounter.
From page 34...
... We recommend reporting a confidence band or standard error concepts we shall explain shortly for the PAC, or for the average if that becomes the summary statistic. These indicate the margin of error in the school report.
From page 35...
... Moreover, a student will be handed one or another of the test forms, and, because of its level of difficulty or its fit to his or her particular profile of competence, it will yield a different score than another selection would have. These variations in individual scores lead to unwanted variation in school scores.
From page 36...
... These will not have had formal Committee review and are not part of the report. A target for accuracy in school reports We come now to the Committee's most critical decision: the target we set for accuracy in CLAS reports on schools.
From page 37...
... The ambition to test everybody can never be realized perfectly, but CLAS may have missed too many students. Some eligible students did not fill out a Student Information Forrn (SIF)
From page 38...
... We recommend that CLAS use 1993 data to kind out how much difference such an adjustment can make, verifying whether the adjustment is likely to be worthwhile in 1994 and thereafter. 4Source: Specimen school reports, p.
From page 39...
... 895 (75.4%) Total 1187 1187 1187 Response rate is defined as Number of students having live booklets divided by (Number of Student Information Forms - Number of Limited English not tested)
From page 40...
... Nor were these students recognized in the school reports, save that a footnote in very small print remarks that if percentages at PUs ~ through 6 add to less than 100, the remaining cases were unscorable (giving some possible reasons)
From page 41...
... That display might have been prepared prior to the decision to sample; but a 1994 specimen press release for use by district superintendents also fails to mention that not all students were scored. The only hint about sampling in the bulky packet is a reproduced page from a school report in which fine print, overshadowed by much forceful information, gives counts of students assessed and students scored.
From page 42...
... Row 2 indicates the sample sizes required to reach an SE of 5.5%, implying a 90-percent confidence band about 18% points wide. CLAS failed to anticipate that the targets in the first row would lead to inaccurate results primarily because measurement errors loom large in the CLAS Table2.
From page 43...
... ~ Both sampling error and measurement error increase the uncertainty of a PAC and ought to be taken into account. The Committee recommends that to the degree practicable at this late date the scoring samples for Grades 4, 5, and 10 in CLAS1994 be distributed so as to minimize the estimated standard errors.
From page 44...
... The Committee recommends against this kind of flat-rate policy because it is not cost-effective. The Committee recommends that CLAS engage a survey statistician who is expert in sampling theory and practice to develop all CLAS sampling designs, methods of estimation, and assessing the precision of all statistics.
From page 45...
... ~- ~ Table 3. Booklets Scored as Percentage of Target (Number of schools and percentage; by grade and school size)
From page 46...
... Grade 8; 70~160 students Percentage of target scored RD WR MA Less than 70% 2 ~ 1%)
From page 47...
... . The tasks performed from beginning to end included test-booklet development and production, production of other materials essential to the assessment (such as instructional manuals, answer sheets, Student Information Forms)
From page 48...
... which are to be held up for further review and correction, if needed, before release. The plan relied on barcodes as the principal means of collating SIFs, answer sheets, and test booklets from the same student, and linking them with the school.
From page 49...
... The Committee urges that random selection of scoring samples be done by computer, with near~equal representation of all test forms in each school. Q~ a" ~ id CHUM The management structure Large projects under government auspices are usually carried out under the leadership of a prime contractor.
From page 50...
... Quality control We judge that many of the operational problems that occurred would have been avoided if explicit quality control procedures had been in place and their adequacy reviewed periodically by CLAS staff. CTB should have a data-receipt and document-controland-storage system that accounts for each and every document, where it is stored, and where it stands with respect to each step in the operation, including sampling and scoring.
From page 51...
... if the test is reporting accurately on the competences students should be acquiring, (ii) if irrelevant features of test tasks and the conditions surrounding them are not making scores better or worse than the students' competence justifies, and (iii)
From page 52...
... A minimum safeguard when student scores are reported is to flag students with (say) 30% omissions.
From page 53...
... Any editor would delete a claim so hollow that 95 out of 100 schools could truthfully make it. The CG device should be supplemented with a report showing how, across schools, performance levels relate to community factors.
From page 54...
... Similar but different nonTinearities appear in maps for other test forms. The judgments should be validated.
From page 55...
... As a by-product of an improved measurement design, CLAS'l994 runs a grave risk of public misunderstanding. We illustrate with Writing.
From page 56...
... The invisible student Earlier we described the presence in some scoring samples of papers to which no numerical grade was assigned. Most of these "X" students evidently failed the test by any usual standard.
From page 57...
... Scoring accuracy in ~ 993 The accuracy of school scores and student scores is improved by suitable distribution of scorers over responses. We understand that CLAS scoring has followed the desirable practice of "spiralling" in scoring, so that if 6 students in a school respond to the same task their papers are likely to be assigned to 6 different judges.
From page 58...
... 12The Draft Technical Report (Table 4.13) shows that in RD-4 two scorings of the same paper agree 60% of the time.
From page 59...
... (The final PE scores can be reported as whole numbers; we are not advocating the reporting of refined student scores such as 3.25.) In Writing, the processing ought to retain one decimal place in each of the two part' scores ore a task that will be weighted to get a task PL.
From page 60...
... The school-level standard error At each step in assessment, some action adds or subtracts from the school's net score, bringing it closer to or farther from the true score. The size of each influence is described by a variance component and these add up to the error variance.
From page 61...
... An inattentive scorer may overlook a faulty ak is the number of forms spiralled in a school, 7~ is the number of students scored, and N is the number of students eligible for testing.
From page 62...
... We recommend that CLAS convene a small group of mathematical statisticians to compare the two models mentioned in the footnote to Table 4 and also to advise on the choice of procedures for estimating CLAS-1994 SEs. For CLAS-1995, we recommend planning for every student in some schools to take at least two test forms and to arrange double scoring for those responses, so as to separate components associated with students, scorers, forms, and their interactions.
From page 63...
... Evidently, variation is substantial in those areas with respect to the match of the demands of various forms to the schools' particular curricula and instructional methods. Me contribution arising from the MC item-sets in MA-IO is large, and suggests a fault in test construction.
From page 64...
... Reusing tasks that have been made public invites inflation of 1995 results, be' cause some teachers and parents will encourage practice on anticipated test tasks. Such inflation has occurred in other States.
From page 65...
... Increases in the SF with Aerlinina n Arc 1 1 ~1 1 ~ 1 ~ ~ gradual. 1 ne reacter should reflect on the fact that increasing scoring from 100 papers to 200-a costly operation-would have narrowed the confidence interval by only 15%.
From page 66...
... In Grade 10, 318 reports out of 3,549 were affected, the higher proportion presumably being traceable to continuation schools. Overall, shortfall in meeting sampling targets was a less pervasive and sub' stantial influence on the accuracy of CLAS-1993 school reports than the fact that in middle~sized and large schools the targets were low.
From page 67...
... CLAS is like other similar assessments in not having come to grips with the fact that a design superior for assessing schools creates difficulties at the student level, and vice versa. In a matrix design the luck of the draw determines whether a student gets a comparatively easy test form or a hard one.
From page 68...
... scores should be much concemed about two-step errors, for example where the student at 3.3 is reported as at ~ or 5. For the year of this report, we consider an SE of 0.7 tolerable, although 5% of students wiD have grossly incorrect reports.l7 Our choice of the 0.7 level is a device to simplify our com 17It is pointless to ask whether this standard is more or less severe than the 2.5% proposed for judging school reports.
From page 69...
... Higher stakes will make the reporting of confidence bands imperative. We warn of an additional pitfall.
From page 70...
... 20This is based on Tables 4.43 and 4.44 of the Draft Technical Report.
From page 71...
... We advise against embarking on large~scaTe reporting of student scores until CLAS has demonstrated its ability to deliver consistently dependable reports on schools. An inescapable dilemma: An assessment that tries tO report at the school level and also at the student level must compromise.
From page 72...
... 1. We used the finite model for converting estimated variance components into schoollevel standard errors, but we did not use a finite correction in estimating variance components.
From page 73...
... that the finite correction on the of component is a function not of the simple average but of the har-n~onic mean. Wiley and ~ have examined this in a limited way but a proper algebraic proof remains to be laid out.
From page 74...
... Student scores 4, 5, and 6 were recoded as I, all others as 0. The analysis would apply to the original scores if an SE for the school mean is wanted.
From page 75...
... Scaling performance levels to a common metric with test tasks from separate test forms. Appendix to Draft Technical Report (DTR)
From page 76...
... . The e term should have Tin empirical work N was based on a count of Student Information Forms, and might be less than the enroll ment.
From page 77...
... l The s variance represents information about schools' true student performance, and does not enter the standard error. If forms in the assessment are regarded as random samples from a domain of suitable tasks as is customary in present performance assessments then the selection of forms constitutes a source of random measurement error.
From page 78...
... (See Draft Technical Report, Tables 1.!
From page 79...
... data from Grade 10 came from 35 ~ schools, with 7 students per cell. (Analysis of variance reported July ~ 9, ~ 994.


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.