Read "A Valedictory: Reflections on 60 Years in Educational Testing" at NAP.edu

Page 17 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

APPENDIX: SAMPLING AND STATISTICAL PROCEDURES USED IN THE CALIFORNIA LEARNING ASSESSMENT SYSTEM

Report of the Select Committee

Lee J. Cronbach (chair)

Stanford University

Norman M. Bradburn

University of Chicago and National Opinion Research Center

Daniel G. Horvitz

National Institutes for the Statistical Sciences

July 25, 1994

Page 18 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Lee J. Cronbach

850 Webster St. #623 Palo Alto, CA 94301

July 25, 1994

The Honorable William D. Dawson

Acting State Superintendent of Public Instruction

721 Capitol Mall Sacramento, CA 95814

Dear Mr. Dawson:

I have the honor to transmit the report of the Select Committee appointed to review the methods of sampling and scoring tests and producing school and district scores in the California Learning Assessment of 1993 and the plans for Spring 1994 assessments.

CLAS and its contractors have embarked on an unprecedented and imaginative project. They have had to accomplish much in a short period of time, and they have done many things well. They are probably as well along the road to satisfying the demands made on State assessments by the new Federal Goals-2000 legislation as any organization.

Problems arose in carrying out the 1993 plan. In this innovative measurement, traditional formulas and modes of thinking break down; other major assessments are encountering similar difficulties. A risk was taken when the State moved so rapidly to official reports on schools. The 1993 assessment pioneered many techniques and operations. It was highly successful as a trial, just because it uncovered so many previously unrecognized problems in large-scale performance assessments.

You asked about the plan with which CLAS-1993 went into the field. The test development was praiseworthy, so far as we can judge. A limited budget precluded scoring of all student responses, and the CLAS plan distributed scoring resources intelligently over schools. Those developing the plan made unreasonably optimistic estimates as to the accuracy that the resulting school reports would have. Even if perfectly executed, the plan would have produced unreliable reports for a great number of schools.

As the plan was carried out, data were lost. Matching student identification sheets to test booklets was seriously incomplete, and the responsibility is shared all along the chain from printer to school to receiving station to the contractor's sorting line for choosing responses to score. Inadequate review under time pressure led to the release of some reports that were manifestly untrustworthy.

Any assessment result is somewhat uncertain, because testing cannot be exhaustive. The uncertainty arising from using only one or two performance tasks per student, and having each scored only once, was unsatisfactorily large in CLAS-1993. For example, in a 60-student fourth grade where every student was scored, 30% might be counted in the superior range. Because of measurement error, we can conclude only that the percentage truly of superior ability probably was between 19 and 41 -- an imprecise finding. Scoring only 31 papers (70% of the CLAS scoring target) would have added only a fraction to the uncertainty, the band becoming 16-to-44. Sampling shortfalls below the 70% level did increase uncertainty enough that validity and comparability

Page 19 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

across schools suffered. Disregarding continuation schools in Grade-10 data, the shortfall reached this level in about 3% of schools.

The operational problems bespeak the difficulty of managing such a complex project. The present CLAS management structure makes quality control difficult; we recommend a new structure centered on a prime contractor. In addition to listing points where quality control is needed, the Committee has suggested many detailed changes in sampling rules, scoring rules, and reporting; we expect these to reduce measurement errors and errors of interpretation.

Our evaluation of plans for CLAS-1994 is limited because major changes were being made week by week as the Committee did its work, and we have not taken into account decisions made after May 13. On some points you asked about, CLAS is still reviewing alternative plans, and indeed in this report we suggest research that should be done before the reporting plans are made final. Major changes in the handling of papers returned from schools in 1994 have been installed. If there is thorough supervision, errors in selecting papers for scoring should be few. Some parts of the system could not be changed after the 1993 problems became clearly defined.

The Committee recommends that CLAS concentrate 1994 scoring and reporting on verifying that, with the changes already under way, it can produce consistently dependable school-level scores. We recommend that reporting of scores for individual pupils be limited to a trial run in a modest number of schools. Extension of the project to individual reporting will no doubt uncover problems not recognized to date, and errors in reports on students can do far more harm than errors in reports on schools.

The Committee applauds CLAS's success in constructing tests that assess reasoning and its success in obtaining cooperation from the State's educators. All the shortcomings of CLAS-1993 can be remedied, within the inexorable limits of the time available for testing and the costs of scoring. CLAS learned a great deal from its experience to date, and we hope that it can learn further from this report. CLAS, as it matures, should be able to deliver a highly useful product.

Yours truly Lee J. Cronbach

Vida Jacks Professor of Education,

Emeritus Stanford University

Page 20 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

		Executive Summary		22
		The Promise and Challenge of CLAS		29
		How the Committee Proceeded		33
		The Number Describing School Performance and Its Uncertainty		34
		A recommendation on school reports		34
		Standard errors and confidence bands		35
		Nonresponse Bias		37
		Sampling of Students as Policy and Practice		40
		The 1993 scoring targets		41
		The shortfall in scoring		44
		Operational Problems: Their Nature and Causes		47
		Loss of data		47
		Breakdowns in the management of documents		48
		Recommendations on administration and quality control		49
		Analysis and Reporting at the School Level		51
		Validity issues		51
		Scoring		54
		Reliability of school scores		60
		Scores for Individual Students		67
		The need for equating		67
		Reliability for individuals		67
		A Final Recommendation		71
		Addendum		72
		End Note		74

Page 21 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

TABLES

Table 1.	Number of schools at each level of student participation	39
Table 2.	Samples considered necessary in Grade-4 Writing under alternative decision rules	42
Table 3.	Booklets scored as percentage of target	45
Table 4.	Components contributing to the uncertainty of the school score	61
Table 5.	Which contributions to the standard error are reduced by possible changes in measurement design?	61
Table 6.	Variance contributions that are treated as constant over schools	63
Table 7.	Variance contributions that are affected by n and N	64
Table 8.	Increase in RD-4 standard error with scoring of fewer students	66
Table 9.	Risk of misclassification associated with various SEs	69

Page 22 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Executive Summary

In 1993, the California Learning Assessment System (CLAS) administered an ambitious examination emphasizing performance measures of achievement rather than multiple-choice questions. Tests in Language Arts and Mathematics were administered to a very high proportion of California students in regular Grade 4, 8, and 10 classrooms in 1993.

The examination exercises represent a major effort, generally successful. The specimens the panel has seen appear to be in harmony with recommendations of National Councils in English and Mathematics and of the Mathematical Sciences Education Board. CLAS-1993 reflects the emphasis of those groups on problem solving and effective communication.

DIFFICULTIES ENCOUNTERED IN 1993

The legislation creating CLAS anticipated that development and implementation would be gradual, and that problem-solving over some years would be required to accomplish all the goals of the legislation. It is remarkable that CLAS has achieved so much by this time. Still, CLAS and similar assessments are in uncharted waters.

CLAS has experienced difficulties of two kinds. Some are operational problems that probably would have been foreseen by a more mature organization, having more experience in management of complex surveys and giving more thorough attention to technical planning. Other difficulties are inherent in the new types of assessment, which face unprecedented problems related to test construction, sampling, scoring rules, reporting, and statistical analysis. These problems are discovered as CLAS or another assessment encounters them, and that is why each trial run is a major learning experience.

Many of the practices we question are of this second type, not to be criticized as deficiencies in CLAS management. CLAS-1994 had made some improvements before our Committee began work and is making changes with each passing week. The activity itself is evidence that CLAS is moving toward maturity.

Scoring performance measures is costly, so CLAS-1993 could score only a sample of responses. The sample was intended to be sufficient to warrant reliable school-level

Page 23 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

summaries of student performance. Shortly after school reports appeared, it was noted that the numbers of students scored in some schools were far smaller than called for in the plan. The Committee, appointed late in April 1994, was charged to evaluate the sampling of booklets for scoring by CLAS-1993, and also to consider its reliability and validity and, prospectively, to consider the planned 1994 methodology and to suggest improvements for the future.

NONRESPONSE

In some schools testing was seriously incomplete. Nonresponse can bias the statistics on a school. Sound statistical techniques for reducing bias due to nonresponse are available and CLAS should use them in 1994 and thereafter. It should also try to reduce the nonresponse rate.

A number of test questions were protested by members of the public. We have no reason to think the protested tasks invalid, but some parents did refuse to let their children take the test in 1994. Wherever defections were frequent, the sample could be unrepresentative, making the 1994 assessment of that school somewhat invalid. The extent and impact of defection in 1994 should be tracked.

THE SAMPLING PLAN AND ITS EXECUTION

A fraction of student responses, chosen roughly at random, were scored and used in calculating school statistics. Sampling is a proven and cost-effective method for surveying achievement. Because samples differ, a summary statistic calculated from a sample leaves some uncertainty about the student body's true performance level. This is “sampling error.” Additional uncertainty comes from “measurement error”, present even when all students are tested and scored. Measurement error arises from the difficulty of the particular tasks assigned, and from transient factors such as how well a student felt on the day of the test.

The usual index of uncertainty is the “standard error” (SE). It is the scientific criterion for choosing a sample size, and it should be calculated in a way that recognizes both types of error.

Page 24 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

The plan for CLAS sampling in 1993 followed a tradition when it looked at sampling error only; but even as a control on sampling error, the plan rested on judgment and not on a scientific rationale. Though the plan did not hold down errors to a satisfactory degree, it did use the available scoring budget rather efficiently.

Many errors were made in carrying out the plan. Schools lost materials, and contractors made missteps. Shortfall below 70% of target appreciably increased SE's for a school. We consider scoring less than 70% of the target a serious shortfall, and that occurred in about 3% of schools (setting aside the higher figures for continuation high schools).

SCORING OF TESTS

Performance tests obtain open-ended responses. In CLAS these are judged on a 6-point scale. To be useful, the scale must mean the same thing from year to year (apart from clearly announced changes). Novel problems arise when, as in some CLAS tests, open-ended and multiple-choice sections are combined to form one score. The Committee suggests studies to verify that the CLAS combining rules are solidly based. Attention to these issues is especially critical because attention will be paid to change in a school's scores from year to year.

SCHOOL REPORTS

Recommendation: CLAS should focus attention on reducing measurement error in its instruments.

CLAS-1993 provided results for a school alongside those for a set of schools having similar demographic profiles. Encouraging schools to compare themselves with similar schools is an excellent practice.

The CLAS-1993 report for a school stated the percentage of students at each performance level, and attached a standard error to each percentage. Those SE's are not interpretable. The natural way to answer the question “Did our school do better than most?” is to combine percentages above some point, for example at levels 4, 5, and 6

Page 25 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

combined. This summary percentage is readily compared across schools, and a meaningful SE can be attached.

Few school-level reports in 1993 had adequate reliability. Our benchmark for adequate precision was a “confidence band” of 8 percentage points (achieved with an SE of 2.5% or below). This promises that if 36% of students scored in a school are in the high-rated group, we are “nearly certain” that the true percentage is in the range 32-40%. (“True” refers to the proportion that would be obtained by complete and accurate measurement. “Nearly certain” means that the true proportion is unlikely to fall outside the stated range for more than 10% of all schools.)

What level of uncertainty is acceptable is not a technical decision. It is to be made by the political and educational communities, keeping in mind the tradeoff between cost and precision. But surely those communities would agree that the uncertainty in CLAS school-by-school results should be reduced from the present level.

All large performance assessments with which the Committee is acquainted are struggling to reduce measurement errors. This difficulty has become fully evident only in the past year. Large SEs are the consequence of allowing no more than one or two hours of testing time per area for examining on a range of complex intellectual tasks. The Committee adapted available theory to appraise the error arising in the novel CLAS design, and to do so pushed the limits of present statistical and measurement theory. It is not surprising, then, that in 1993 no one anticipated how large the measurement errors would be.

Changes in design that could reduce one or more sources of uncertainty, and so reduce the overall SE, are these: scoring more papers per school, entering more test forms into the spiral design, making test forms more comparable, requiring more time so that more tasks are administered, improving the consistency of scoring, and modifying the scoring rules. Increasing the number of responses scored, whether by scoring more students or more tasks per student, is especially costly. Increasing the number of forms has particular value; the evidence on this argues against publication of tasks suitable for reuse in 1995.

Taking as the summary for a school the percentage of students at high performance levels may be unwise. It appears that the average performance-level score in the school can be estimated with less uncertainty than such a percentage.

Page 26 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

OPERATIONAL PROBLEMS AND MANAGEMENT

Recommendation: CLAS should be administered through a prime contractor using subcontractors. The prime contractor should have clear responsibility to ensure quality control. CLAS staff should focus on improving the design of the assessments, their accuracy, and the efficiency with which they are carried out.

Recommendation: CLAS should make all statistical design, sampling, and estimation the responsibility of an expert survey statistician. It should add to its staff a full-time technical coordinator who would work with the statistician and with an expert study group, to analyze and interpret evidence on the accuracy of CLAS.

CLAS operations are vast and complex. The necessary work included development and production of tests and other essential documents and forms; shipping of materials to and from each school; scoring and statistical analysis; and finally, reporting. Complexity on this scale makes explicit and complete quality-control procedures vital. These were inadequate in CLAS-1993. Large projects such as CLAS are usually carried out under a prime contractor, who delegates tasks requiring special resources. The prime contractor holds each subcontractor responsible for completing each prescribed task on time and with specified quality. The prime contractor becomes responsible for the quality of all the work. Contractors should be required to implement quality control and/or assurance procedures for each of their distinct tasks and to generate periodic reports, based on these procedures, that reflect the extent to which the specified criteria are being met. If such procedures are not in place for 1994, many of the 1993 problems can be expected to reappear.

The management structure of CLAS-1993 was not much like this systematic model. There was no prime contractor; CLAS awarded at least three separate main contracts. CLAS staff undertook to coordinate the work of the contractors, provide overall management, and monitor the quality of the diverse products. The small CLAS staff could not do all this thoroughly.

Operations went awry at several points in 1993. Although fewer than 4% of the schools had data losses so severe that the original reports were officially called into question, partial data losses were widespread. These losses and the release of some undependable reports came about largely because an inadequate management structure provided poor quality control.

Page 27 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

The recommendation on technical staffing reflects the Committee's awareness of the difficulties—in this exploratory venture—of designing sampling plans, adjusting for nonresponse, estimating standard errors, interpreting statistical findings, and improving measurement design.

Will the quality of school reports be better in CLAS-1994? Many changes in technology and organization for data processing have been made by the contractor, so mistakes should be greatly reduced. The CLAS staff now is considering the suggestion of reporting school means alongside the percentage distribution. The test has been lengthened in several areas, and improvements in the sampling plan have been installed. More responses are being scored, and we have been told of plans to improve scorer agreement. Accuracy will be better, then, in CLAS-1994, though the experience will no doubt indicate places for further improvement.

REPORTING INDIVIDUAL SCORES IN 1994

Recommendation: Reporting of scores for individual students should be limited in 1994 to experimental trials in a few schools or districts.

CLAS-1994 has planned to report on individual students in Grade 8. The number of exercises was increased so as to reduce student-level SE's. It appears that there will be few “two step” errors in classifying individuals—of a true “Level 3” student being called “1” or “5”, for example. One-step misclassifications will be frequent. This high rate would be unacceptable wherever significant rewards and penalties are attached to the scores. The move to report individual scores seems premature.

A design superior for assessing schools creates difficulties at the individual-student level, and vice versa. Having different students take different test forms improves school reports, but the luck of the draw determines whether a student gets a comparatively easy test form or a hard one. CLAS will need a way to allow for this inequity. Allowing for unequal opportunity to learn is another important concern. A further year of preparatory work would permit attention to these issues and to the way students, parents, and teachers react to and use reports on individuals.

If delay in individual reporting requires reversal of decisions previously made by the State Board of Education and the Legislature, we recommend such reversal.

Page 28 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

CONCLUSION

The assessment community in California and throughout the nation is being pressed to deliver dependable information when the groundwork for accurate performance assessment has not been laid. A well-intentioned and popular mandate can do harm if it ignores potential hazards from too-rapid action. The Committee found that CLAS-1993 fell far short of its intended quality; to the extent that we know the facts, all States using performance assessment are facing similar difficulties with unreliability.

Limiting the number of responses scored necessarily increased the uncertainty of school scores in CLAS-1993. Within that budgetary limitation, faults in execution of the sampling plan caused additional uncertainty. Measurement errors are the main source of uncertainty; these are reduced only in part by enlarging the scoring sample. Even recognizing that improvements have been made and will continue to be made, significant problems remain to be solved at the level of school reporting. We advise against embarking on large-scale student-by-student reporting until CLAS has demonstrated its ability to deliver consistently dependable reports on schools.

We applaud the energy and imagination that have gone into CLAS to this point. But public confidence in CLAS was severely tested by the experience in 1993. We do not wish to see confidence in its potential further undercut by premature expansion and extension.

Page 29 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Report of the Select Committee on Sampling and Statistical Procedures in the California Learning Assessment System

The drive to reform America's schools has gained steadily in momentum, starting with publication of A Nation at Risk in 1984, building up to the Governors' Conference of 1989, and reaching in 1994 the landmark passage of the Goals 2000 legislation with bipartisan support. A great number of States rely on improved assessment as a strategy for change.

Supporters of intensified assessment have various expectations. Some hope to stimulate teachers and students to work harder. Some hope that assessment exercises geared to educational goals neglected in past teaching and testing will bring those high-level goals to prominence, and so change instruction. Some hope primarily to improve the quality of information used in decisions about education made by administrators, public representatives, and parents.

California has been a pioneer in this movement. No other large State has come so far toward practical realization of the ideals expressed in the Goals 2000 legislation and in numerous reports of leadership groups. In 1993 the California Learning Assessment System (CLAS) administered tests in Language Arts and Mathematics to over three-quarters of the students in three grades in California schools.

THE PROMISE AND CHALLENGE OF CLAS

California began a statewide system of testing public school students in 1961 and has continued to innovate in assessment systems. In 1972, the State's testing system became the California Assessment Program (CAP) which focused primarily on the effectiveness of instructional programs rather than on the progress of individual students. CAP operated at the frontier of testing technology by implementing such procedures as matrix sampling, a method in which each student receives a different set of items, and test content developed using item response theory, a psychometric theory that provides a means of scaling test items. CAP used machine-scored multiple-choice questions.

Page 30 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Multiple-choice questions have come under severe criticism for not assessing students' ability to perform complex cognitive tasks. Beginning in the mid-1980's, CAP implemented a performance assessment in Writing and began the developmental work necessary to move toward a system based to a greater extent on performance measures in Language Arts and Mathematics. In 1991 the Legislature and the Governor, through Senate Bill 662, mandated development of CLAS. This system was to differ in two main ways from CAP. First, it was to be based primarily on tasks that require students to answer questions and solve problems in their own words, rather than picking a best answer from the choices offered. Second, the assessments were to provide scores for individual students as well as profiles of achievement for schools and school districts. CLAS continued to use matrix sampling and item response theory.

The examination tasks developed for CLAS represent a major and generally successful effort. The specimens the panel has seen appear to be in harmony with the recommendations of the National Councils in English and Mathematics and of the Mathematical Sciences Education Board. A key recommendation of those groups of educators is to place greater emphasis on problem solving and effective communication. CLAS-1993 reflects this emphasis.

The legislation creating CLAS anticipated that development and implementation would be gradual, and that problem-solving over a period of some years would be required to accomplish all the goals of the legislation. Having a faultless system in place during the first year of operation was not to be expected.

Risks and costs in innovation

Improvements in measuring important academic goals come at increased cost. In particular, performance measures are costly to score. They require judgments by specially trained scorers, usually teachers in the areas being assessed. Because of budget limitations, it was necessary in CLAS-1993 to score only a sample of the response booklets from each school. The sample was to be sufficiently large to enable CLAS to report a reliable score in each area for each school in the State.

CLAS has experienced difficulties of two kinds. The operational problems have precedents of sorts, and many could have been foreseen in 1993 by an organization having more experience in management of complex surveys, and giving thorough consideration to technical planning. The CAP of earlier years tested similar numbers of students; but short machine-scored tests provided no foretaste of the logistical difficulties entailed by a radically new approach.

Page 31 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

The second kind of difficulty is inherent in the novelty of the assessment; these were and are unforeseeable. Every large performance assessment is having to think through unprecedented problems related to sampling, scoring rules, reporting, and statistical analysis. Such problems are discovered as CLAS or some similar assessment stumbles upon them, and that is why each trial run is a major learning experience. Many of the shortcomings of CLAS-1993 are of this second type, not to be criticized as deficiencies in management. CLAS-1994 had made some improvements before our Committee came on the scene, and accepted some suggestions from us at our first meeting. The Committee could not review changes being made with each passing week, but the activity is evidence that CLAS is moving toward maturity.

ABBREVIATIONS AND CONVENTIONS

	A number of short expressions and acronyms will be used in order to shorten the report. This is a reference list.
CLAS	Refers to the California Learning Assessment System as an organization, or to the data-collection survey of any year.
CLAS-1993	Refers to the assessment conducted in 1993 and reported in 1994.
CLAS-1994	Refers to the assessment for which data have been collected in 1994.
GA	“Grade and area” such as Grade-4 Reading. RD-4 (for example) refers specifically to that grade and area.
MC	Multiple-choice.
CR	Constructed-response.
OE	Open-ended (tasks calling for written responses or observable performance).
Task	Refers to a subdivision of CLAS that is assigned a score. It may be an MC or CR item, or an OE exercise.
Scoring	Refers to the judgmental rating of OE and CR responses. (MC tasks are scored by machine and thus require no discussion.)
PL	Performance Level, the 1-to-6 scale used in reports on schools and individuals. An intermediate value such as 3.6 is used in describing the hypothetical average performance (“true PL”) that extremely thorough measurement would report.
PAC	Percentage of students in a school who are Above a Cut level (e. g., 4+ refers to students scoring at 4, 5, and 6 combined, thus, “above 3.5”).
SE	Standard error. May refer to the standard error of a PAC, of a student 's PL score, or of a school average on the PL scale.
SIF	Student Information Form.

Page 32 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

The Legislature and the Department of Education took a risk when they agreed that the 1993 practice run would be the basis for a public report on performance in each school. The time schedule and budget, combined with the profession's lack of experience with performance assessments, made errors inevitable. As an offsetting benefit, carrying out a “real” assessment undoubtedly elicited more cooperation from teachers than a “pilot study” would have. Also, it served to expose the public reports to instructive criticism. With CLAS's 1993 resources and experience, it was a major accomplishment to put in place a first approximation to new-style assessment of schools, districts, and the State.

To put our analysis of technical matters into perspective, let us note some other features of the program. Tasks presenting thought problems require more class time than old-style many-to-the-page questions. For example, to test understanding of a short story takes longer than testing comprehension of isolated paragraphs. The testing would not have been carried out as successfully as it was without a substantial effort to develop understanding and support among administrators and teachers.

Especially significant for our analyses is the matrix or spiral design, a practice that was used in CAP and is adopted in many current performance assessments. In any grade and area, several alternative test booklets are prepared. In CLAS-1993 Reading, for example, there were 6 booklets, each presenting a different selection to be read, and questions pertinent to it. Within a classroom, about one-sixth of the students worked on each booklet. This has the virtue of bringing a greater variety of tasks into the school-level assessment, thus better representing the curriculum.

It is reasonable to think that CLAS has been improving instruction. One aim is to make test exercises as nearly as possible like a segment of good instruction. California teachers developed the new exercises cooperatively, and in the process had a chance to learn about the new emphases and about activities they could use in teaching. Also, teachers who took part in scoring regarded it, we are told, as high-grade in-service training. They were able to see concretely what qualities they should be demanding from students' responses in their own classrooms.

Page 33 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

HOW THE COMMITTEE PROCEEDED

The Committee was appointed by the Superintendent for Public Instruction late in April 1994. It was given a broad charge: to evaluate in particular the scoring system that led to reports on performance of schools in 1993, but also to comment on CLAS-1993 reliability and validity and then evaluate the planned 1994 methodology and suggest improvements for the future.

In the first weeks following the release of 1993 school data, the reports for some schools had been found unsatisfactory. When the Committee met for the first time, additional scoring was in progress for schools that had received moot reports. The tests for 1994 had already been distributed to schools, and some fraction were being given each day. We proposed immediate alterations of 1994 plans to the CLAS leadership wherever we saw a chance for timely improvement. In particular, the Committee outlined a major change in the selection of responses to be scored.

We requested statistical analyses to obtain sounder estimates for facts that had been reported, and to probe into additional questions. We have been pleased by the readiness of the contractor, CTB, to do this work rapidly. Our assistant Haggai Kupermintz turned out complex analyses with great skill as deadlines approached, and provided valuable suggestions. We acknowledge also the help of David Wiley, a CLAS consultant, with analyses and their interpretation, We are grateful to the staffs of CLAS and CTB for the frankness with which they shared the insiders' view with us.

Although we found fault with some previously completed analyses, we do not disparage the quality of CLAS's technical advisers. The statistical-psychometric experts who make up part of the CLAS Technical Advisory Committee are highly able, and they have made sound and creative recommendations. Still, too many problems escaped attention, and some analyses, once made, were not sufficiently reflected upon. Part of the trouble comes from over-reliance on hastily called meetings with crowded agenda. Beyond that, the Technical Advisory Committee is asked to deal with broad issues and therefore properly has a broad membership. Concentrated and continuous technical study is needed. Without a full-time technical coordinator in Sacramento, drawing on advice of an expert study group, CLAS will be unable to recognize and resolve the problems these innovative assessments encounter.

Page 34 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

THE NUMBER DESCRIBING SCHOOL PERFORMANCE AND ITS UNCERTAINTY

A recommendation on school reports

For any school, grade, and content area, CLAS-1993 reported the percentage of students at each of six Performance Levels (PL). But when Golden Poppy Elementary School, for example, wishes to judge its performance against that of comparable schools, it should not ask whether the percentage of its students at PL 4 is reliably different from the percentage at 4 in the other schools. The natural way to answer “Did we do better?” is to compare across schools the combined percentage at levels 4, 5, and 6. This is symbolized hereafter by “4+” and designated PAC for “Percent Above Cut.” For technical reasons, 4+ implies a cut at 3.5. The cut for a PAC could be placed anywhere along the scale.

The PAC serves better as a summary than the list of percentages. It allows more compact reports that are more likely to be read attentively. The reader does not have to do mental arithmetic to see what the list of six numbers “adds up to.” If CLAS wishes to report in percentage terms, we recommend that future school reports feature one or two PAC values in an area. In view of the distribution of PLs at this time, cuts at 2.5 and 3.5 (i.e., 3+ and 4+) seem suitable. We understand the reluctance of CLAS to pick any one cut such as 3.5 as a seeming “standard”. Although more than two cuts could be used in reporting, such complexity seems inadvisable.

An alternative to reporting the PAC is to report the average of student scores. The philosophy of distinct levels dominates the text of the school report, so the Committee focused on PAC. However, it also did small studies of the average. There is reason to think that the average is more accurately determined than the percentage. CLAS should study further the possibility of emphasizing means in school reports.

We recommend reporting a confidence band or standard error—concepts we shall explain shortly—for the PAC, or for the average if that becomes the summary statistic. These indicate the margin of error in the school report. We recommend reporting for a school the SE or confidence band typical of schools of its size that administered as many test forms as it did. Correspondingly, we recommend against calculating a school-by-school SE based entirely on the limited information from the particular school (as CLAS did in 1993).

Page 35 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

We recommend that in the future the set of percentages for separate PLs be made supplementary. They are inaccurate and distracting in news releases, but possibly useful to a school. SEs should not be attached to them.

A single cut-point is sufficient for the analyses in this Committee 's report; adding a second cut would not change the story appreciably. After examining the PL distributions, we requested analyses at 4+ in Reading and Writing, a level reached by 30-40% of students Statewide. Because scores ran lower in Mathematics, we analyzed at 3+ (about 25%).¹ We shall review later the features of the Mathematics scale that held scores down.

We recommend considerably greater attention to helping the public understand the uncertainty of CLAS reports. In the voluminous packet of materials intended for public-information functions in 1993—including a full script with 30 displays for projection, prepared for use by school officials in meetings for parents and community groups—just one hint was given that any uncertainty attaches to CLAS results. That hint? In a facsimile of a school report (transparency 26), “{Std. error}” columns appear, unexplained, alongside the “%” columns.

Standard errors and confidence bands

Types of error

All our comments on sampling and reliability rely on “standard errors” (SEs). Laymen encounter this concept whenever a public opinion poll says that the percentage of voters favoring a candidate “is accurate within 4 per cent.” Uncertainty attaches to any measurement or survey; and the SE or the related confidence band describes the uncertainty.

Student scores are subject to measurement error. As a consequence, a retest on the very same test booklet is unlikely to report exactly the same score for a student. Moreover, a student will be handed one or another of the test forms, and, because of its level of difficulty or its fit to his or her particular profile of competence, it will yield a different score than another selection would have. These variations in individual scores lead to unwanted variation in school scores. Beyond that, school PACs are imprecise because of the sampling of students, when not all students are scored.

¹	Source: Specimen 1993 school reports.

Page 36 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

School scores are subject to a measurement error distinct from variation among its pupils. This error, the “school-form interaction”, is sometimes much larger than the other contributions to the SE. We must return to this topic, but let us explain the concept briefly. School curricula differ; scores in a school will be higher if topics and skills a test form assesses have been covered by the school. Science offers clear examples. The tasks in any year are only a sample from the topics within the California framework. If ecology is emphasized more in some schools than in others, a question in that subarea will make the former schools look better than they would have had a task on, say, the solar system been used instead.

Properly calculated, the SE of the PAC in any grade and area will estimate the combined effects of sampling error and measurement error. The formulas hitherto used to obtain SEs in educational measurement either do not attend to both kinds of errors simultaneously, or combine them in a manner unsuited to the CLAS design. We therefore developed and applied a new procedure. For our reasoning, see the Endnote.

A most important side remark: This is one of many places where we stumbled upon a problem at the very frontier of psychometrics. Solving these one by one should improve future CLAS reports considerably. Although not all problems will be solved, something is gained just by recognizing stumbling blocks. Members of the Committee plan to file with CLAS, as time permits, statements pointing out unresolved technical questions. These will not have had formal Committee review and are not part of the report.

A target for accuracy in school reports

We come now to the Committee's most critical decision: the target we set for accuracy in CLAS reports on schools. This decision was reached unanimously after thorough discussion at the first meeting of the Committee. The reader can judge, from the last sentence of the following paragraph, whether the confidence band we call for is too narrow, too broad, or about right. In the long run, this is a decision for schools and public officials to make; narrower bands are achieved primarily by increasing testing time and scoring effort.

We set an SE of 2.5% as a reasonable target for the accuracy of a PAC. The 2.5% guideline is not hard and fast. The State should be comfortable with SEs near that level, but this report would change only in detail if we had chosen 3% or 3.5% as a benchmark. An SE of 2.5% implies a “90-per-cent confidence band” of ±4.1% for the true percent-

Page 37 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

age P at and above 4 in, say, Writing in Grade 4 in any school.² That means, for example, that in a school where the observed p is 36%, the true P can plausibly be as large as 36 + 4.1 = 40.1%. If P = 40.1%, repeated measurement using the identical design would be expected to yield an observed PAC of 36% or less no more than 5 times out of 100. Similarly, the smallest plausible P is 36 − 4.1 = 31.9%. If that is the true P, p's at or above 36% will occur less than 5% of the time. Thus one can have 90% confidence that the reported percentage will not deviate from the true PAC by more than 4.1%. A case can be made for reporting, instead of the uncertain 36%, the near-certain band: 32%-40%.

To illustrate this key concept further, suppose SE = 5% in a school where PAC is reported as 36%. The report is saying only that there is good reason to believe that this school's PAC this year truly lies in the range 28%–44%. Is so imprecise a finding worth the CLAS effort?

Before turning to larger issues, we comment on a CLAS-1993 practice. In the report of 6 PL percentages for a school in any grade and area, 6 corresponding SEs were also reported. These SEs are next to uninterpretable ³ because the percentages are not independent. When one goes up another must go down.

NONRESPONSE BIAS

The Legislature asks CLAS to assess schools by testing all students, or to come as close as possible to that goal (after exempting, for example, recent immigrants whose native language is not English). The concern of lawmakers is heightened by tales of schools that try to enhance their image by encouraging less able students to stay home on State-assessment days.

The ambition to test everybody can never be realized perfectly, but CLAS may have missed too many students. Some eligible students did not fill out a Student Information Form (SIF) and took no tests; no one knows the number of these. Some students were not administered a test because a teacher found the schedule too

²	The figure 4.1 = 2.5 × 1.64, the multiplier 1.64 coming from the normal probability distribution. In scientific reports a higher multiplier, corresponding to “95-per-cent confidence” is generally used.

³	To speak more precisely, an interpretation would have to be so rooted in higher mathematics as to be useless to educators and lay persons.

Page 38 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

crowded. The booklets or answer sheets of others were separated irretrievably from the corresponding SIF somewhere between the classroom and the scoring table. There were recognizable nonparticipants, who filled out the SIF on the first day of assessment and then were absent from one or more tests in the following days. Finally come the “invisible ” test takers to whom we return at the end of this section. Our figures on nonresponse are far from precise. They cannot take into account students who filled out no SIF. The figures have to count as nonrespondents students who failed to attach a barcode label so that the response could be matched to the SIF, and even students whose teacher failed to give a test.

According to the best data available, what proportion of eligible students completed the 1993 tests? Although there was a high Statewide rate of participation in Grade 8 (89%-92%), the rate in Grade 10 was significantly lower (81%-84%).⁴Table 1 gives a detailed breakdown. The worst result is in WR-10, where the participation rate was below 81% in nearly one-fifth of the schools. The problem arose largely in continuation schools, which made up 75% of the small high schools (N less than 80). Irregular attendance, hence missed tests, are to be expected in those schools.

Nonresponse can bias the statistics on a school and is therefore a matter for concern. Nonrespondents often differ from respondents with respect to what the survey measures. It is likely that tenth graders who did not participate tend to come from homes that value education less than the average family. Nonparticipants, if tested, would probably have scored lower on average than tenth graders who did participate.

CLAS should obviously continue to try to reduce nonresponse. Beyond that, sound statistical techniques for reducing bias due to nonresponse are available. The essential idea is to group students by characteristics reported in the Student Information Form, calculate a score distribution for participants from each group whose papers were scored, and then combine the group distributions into one for the school using weights proportional to the number of students in each group who supplied an SIF.

We recommend that CLAS use 1993 data to find out how much difference such an adjustment can make, verifying whether the adjustment is likely to be worthwhile in 1994 and thereafter.

⁴

Source: Specimen school reports, p. 1. Information on Grade 4 is unsatisfactory. The count reported for the State, in this source and in Table 1, is the number of “booklets” identifiable with students. In Grade 4 a single booklet included tests in three areas. A student who took only one of the three tests, because of absence, was counted as a participant in all three columns. Table 1 gives a count for Reading which is probably only slightly inflated, as the first day of testing was used for Reading.

Page 39 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Table 1. Number of Schools at Each Level of Student Participation

Grade 4	Response rate	Reading^a
	< 50%	123	( 2.6%)
	51-60%	52	( 1.1%)
	61-70%	82	( 1.7%)
	71-80%	158	( 3.3%)
	81-90%	354	( 7.5%)
	91-100%	3974	(83.8%)
Total		4743
Grade 8	Response rate	Reading		Writing		Mathematics
	< 50%	20	( 1.3%)	33	( 2.1%)	20	( 1.3%)
	51-60%	6	( 0.4%)	8	( 0.5%)	6	( 0.4%)
	61-70%	8	( 0.5%)	11	( 0.7%)	8	( 0.5%)
	71-80%	31	( 2.0%)	52	( 3.3%)	31	( 2.0%)
	81-90%	151	( 9.5%)	254	(16.0%)	151	( 9.5%)
	91-100%	1367	(86.4%)	1225	(77.4%)	1367	(86.4%)
Total		1583		1583		1583
Grade 10	Response rate	Reading		Writing		Mathematics
	< 50%	28	( 2.4%)	41	( 3.5%)	20	( 2.4%)
	51-60%	16	( 1.2%)	19	( 1.6%)	16	( 1.3%)
	61-70%	19	( 1.6%)	46	( 3.9%)	19	( 1.6%)
	71-80%	53	( 4.5%)	109	( 9.2%)	53	( 4.5%)
	81-90%	176	(14.8%)	290	(24.4%)	176	(14.8%)
	91-100%	895	(75.4%)	682	(57.5%)	895	(75.4%)
Total		1187		1187		1187

Response rate is defined as Number of students having live booklets divided by (Number of Student Information Forms – Number of Limited English not tested).

^a	See text footnote 4.

Source: Printout from CTB, June 14, 1994.

Page 40 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Some fraction of students took the tests, were chosen for scoring, and then were assigned the mark ‘X' in lieu of a performance level. ⁵ The reason for the X might be a blank response booklet, a paper filled with doodling, a response that missed the point of the problem and said nothing relevant, or use of a language other than English. The reader of a newspaper listing of the CLAS score distribution would never realize that there was such an additional category. Nor were these students recognized in the school reports, save that a footnote in very small print remarks that if percentages at PLs 1 through 6 add to less than 100, the remaining cases were unscorable (giving some possible reasons). The cases were counted in the denominator for the school percentages. In calculating PAC for our studies, all the unscorable responses were counted as below the cut. These cases are in a sense nonparticipants, but they are not counted as such in Table 1.

If the students who failed utterly to answer the question had been placed at the lowest PL, some school distributions would properly have been shifted downward. We say a bit more about X cases when we come to scoring rules.

SAMPLING OF STUDENTS AS POLICY AND PRACTICE

Although the effort to test virtually all eligible students in the chosen grades was at least 80% successful in most grades and areas, only a fraction of the responses were scored and used in school statistics. The reason was simply the cost of scoring. Questioning of the sampling practiced in CLAS-1993 has taken three forms:

Some critics doubt that reports based on samples can be valid.
There are those who consider the target samples, near 25% in some schools, too small to be dependable.
There are those who note that actual scoring fell short of the target number in some schools, rendering those particular reports suspect.

Criticism (i) we can dismiss outright. Sampling is a proven, well-recognized, and cost-effective method for surveying achievement. Most of the now-popular learning assess-

⁵	Draft Technical Report, p. 2-7. We have no count of these cases, but in most schools X's accounted for 2% or less of the scored sample. This figure was inferred from a listing of counts for forms supplied by CTB on June 16, 1994.

Page 41 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

ments in the United States and other countries do not attempt to test every student in a chosen grade in every school. For national or regional statistics, sampling does a fully adequate job at acceptable cost. At the school level also, sampling can provide sufficiently accurate information. (Naturally, when CLAS plans to report on every student in a grade, there is no place for sampling.)

CLAS laid a trap for itself when, in the public-information packet intended to develop understanding of the system, it omitted mention of its intent to score only a fraction of booklets. Thus a transparency on “1993 Scoring Process”, for use at parent meetings, makes no mention of sampling. That display might have been prepared prior to the decision to sample; but a 1994 specimen press release for use by district superintendents also fails to mention that not all students were scored. The only hint about sampling in the bulky packet is a reproduced page from a school report in which fine print, overshadowed by much forceful information, gives counts of students assessed and students scored. We noted earlier the failure to speak frankly about uncertainty. Obviously, the Committee recommends that CLAS tell the whole truth in its briefing materials.

It is reasonable to expect CLAS to obtain the information wanted by the public, teachers, educational administrators, and legislators without spending more than is necessary. The CLAS mandate in 1993 required testing all students in Grades 4, 8, and 10, but did not require scoring all responses. The mandate also required use of “performance-based” tasks, which call for written (open-ended, or OE) responses that can be scored only judgmentally. As CLAS-1993 developed, it became clear that the budget was insufficient for rating all the responses that had been collected. Except where the budget and the accuracy of the test allow reporting on every student, it will be necessary and proper to score only a fraction of OE responses.

The 1993 scoring targets

The total amount of scoring in 1993 was determined by the available budget and the anticipated per-paper cost of scoring. The 1993 sample was intended to be large enough to provide trustworthy estimates of the PL distribution that would have been obtained, had responses of all students in that grade and area of the school been scored. Criticism (ii) requires review of that plan. The sample size for a school

Page 42 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

should be guided primarily by the expected SE of key results. The sample size required to achieve a specified SE for an estimate of PAC depends on how much student PLs vary and on the number of students providing responses.

Criticism (ii) turns out to be valid with respect to CLAS-1993, even though a 25% sample could be highly satisfactory in many surveys. This is not to say that CLAS made a mistake in choosing scoring targets. The amount of scoring CLAS could do was limited by the budget, and evidence to be presented in Table 8 indicates that the chosen targets distributed the affordable scoring in a sensible way.

The first row in Table 2 gives, for schools of a particular size, the sample size chosen as target by CLAS for 1993 scoring. It was guided by the principle of sampling all students in small schools, and sampling not less than 25% of the eligible population in large schools. Intermediate values were taken from a simple trend line stepping down from 100% to 25%.

The Committee attempted to estimate what targets would have held SEs for WR-4 near or below 2.5%, and found that this value could not be reached even with scoring of all students in a large school. The difficulty arises primarily from the attempt to measure a broad domain with only a few tasks. Row 2 indicates the sample sizes required to reach an SE of 5.5%, implying a 90-percent confidence band about 18% points wide. CLAS failed to anticipate that the targets in the first row would lead to inaccurate results primarily because measurement errors loom large in the CLAS

Table 2. Samples Considered Necessary in Grade-4 Writing Under Alternative Rules

	Sample size recommended when school size is
	40	80	120	160	200	250	300
CLAS-1993 plana	40	46	46	48	50	64	74
Required for PAC SE < 5.5%b	—	—	—	—	196	215	228
Required for ±0.2 band for average PL	—	69	78	84	88	91	93
— indicates that the specified SE cannot be reached with 100% scoring. ^aSource: Draft Technical Report, Table 2.2. ^bCalculations are based on components from Table 6 and Table 7.

Page 43 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

data.⁶ Row 2 would give less distressing numbers if we had developed the figures for some other grades and areas; but we could not work out the companion Row 3 for Grade 10 or for MA-4; RD-4 would have given numbers much like those for WR-4. (We shall reach Row 3 shortly.)

Both sampling error and measurement error increase the uncertainty of a PAC and ought to be taken into account. The Committee recommends that to the degree practicable at this late date the scoring samples for Grades 4, 5, and 10 in CLAS-1994 be distributed so as to minimize the estimated standard errors. This may require more sampling in some areas than others. The appropriate numbers will vary from one grade and area to another. It should be emphasized that wherever the number of tasks spiralled in a school is appreciably increased in 1994, an estimate for 1994 sample sizes would be more encouraging than those in Table 2.

Assessments similar to CLAS have been trying to assess both kinds of errors and we have informal knowledge that they too have, very recently, become conscious of large measurement errors.

CLAS is here being judged by the accuracy of the PAC technique that it did not use in 1993. That was the proposal of this Committee, whereas CLAS had reported point-by-point percentages. Both of those kinds of report probably are subject to more uncertainty than a conventional average for the school on the 6-point scale. We have strong but incomplete evidence on that. We estimate a confidence band of ±0.2 on the 1–6 scale, for 1993 WR-4.⁷ When we carry out sample-size calculations analogous to those we made for the SE, we find that this accuracy can be attained with strikingly economical samples. See Row 3 of Table 2. The Committee cannot judge, however, whether the community can make better use of the PAC than the average. Also, the average makes the debatable assumption that steps along the PL scale are equal in importance.

⁶	“It was assumed in reporting the 1993 school-level results that sampling variability, rather than the PL measurement error, would have the greatest effect on the precision of the score reports.” (Draft Technical Report, pp. 4-5.) This assumption was incorrect.

⁷

The pilot study data for 1994 (printout from CTB dated June 2, 1994) provided variance components for studies in approximately 100 schools where subsets of students had been scored on two tasks from various forms. The medians were 0.015 for task, 0.032 for the school-form interaction, 0.296 for student within school, and 0.447 for the residual. Assuming 6 tasks and n = 50, N = 100, the SE² is 0.0025 + 0.0053 + 0.0030 + 0.0089 = 0.0197 and the SE is 0.14.

Page 44 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Improvements in sampling in 1994 and beyond

Stratified sampling is usually more efficient than simple random sampling. The electronically scannable Student Information Forms provide information for stratifying. A stratified-sampling scheme was sketched by the Committee, and details were worked out by CTB. This plan, if executed well, provides greater accuracy than the 1993 plan offered.

The plan for CLAS-1994 (as of June 23) again calls for scoring at least 25% of booklets completed, regardless of the size of the school. The Committee recommends against this kind of flat-rate policy because it is not cost-effective. The Committee recommends that CLAS engage a survey statistician who is expert in sampling theory and practice to develop all CLAS sampling designs, methods of estimation, and assessing the precision of all statistics. It might hire a consultant or charge its prime contractor to provide such expert service.

The shortfall in scoring

The sampling plan was not implemented exactly as planned for a number of schools. Selecting booklets for scoring ran into a variety of snags. Consistent with criticism (iii), Table 3 shows that scoring in some grades and areas ran much less than the target. Table 8, in a later section, testifies that scoring of 70% of target gave SEs not much worse than those from full-target scoring. SEs increased rapidly as scoring dropped below 70% of target. Table 3 does not subdivide this serious-shortfall category, but (according to a detailed tabulation not shown) those schools were predominantly in the 50-70% bracket. In 18 of 27 cells of the table, the percentage of serious underscoring is under 3.5%. The notably bad set of results in Grade 10 in small schools is another outcropping of the peculiarities of continuation schools.

The procedures for sampling booklets from each school need to be strengthened considerably for CLAS-1994. There has been a substantial change in physical arrangements, equipment, and record-keeping at CTB, and these, plus the fact that CTB is placing selection under computer control, should prevent recurrence of the 1993 shortfall.

Page 45 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Table 3. Booklets Scored as Percentage of Target

(Number of schools and percentage; by grade and school size)

Grade 4; 1-49 students
Percentage of target scored	RD	WR	MA
Less than 70%	27 ( 3%)	38 ( 4%)	24 ( 3%)
70-90%	108 (13%)	168 (19%)	166 (19%)
More than 90%^a	727 (84%)	656 (76%)	672 (78%)
Total^b	862	862	862
Grade 4; 50-99 students
Percentage of target scored	RD	WR	MA
Less than 70%	44 ( 2%)	54 ( 2%)	34 ( 1%)
70-90%	353 (14%)	741 (29%)	207 (8%)
More than 90%	2130 (84%)	1732 (69%)	2286 (91%)
Total	2527	2527	2527
Grade 4; 100+ students
Percentage of target scored	RD	WR	MA
Less than 70%	46 ( 4%)	56 ( 5%)	19 ( 2%)
70-90%	192 (16%)	302 (24%)	98 ( 8%)
More than 90%	1000 (80%)	880 (71%)	1121 (90%)
Total	1238	1238	1238
Grade 8; 1-69 students
Percentage of target scored	RD	WR	MA
Less than 70%	24 ( 5%)	45 ( 9%)	18 ( 4%)
70-90%	46 ( 9%)	141 (29%)	67 (14%)
More than 90%	417 (86%)	301 (62%)	402 (83%)
Total	487	487	487

Page 46 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Grade 8; 70-160 students
Percentage of target scored	RD	WR	MA
Less than 70%	2 ( 1%)	5 ( 2%)	3 ( 1%)
70-90%	18 ( 8%)	47 ( 21%)	20 ( 9%)
More than 90%	203 (91%)	171 ( 77%)	200 ( 90%)
Total	223	223	223
Grade 8; 161+ students
Percentage of target scored	RD	WR	MA
Less than 70%	5 ( 1%)	14 ( 2%)	20 ( 2%)
70-90%	33 ( 4%)	86 ( 10%)	126 (15%)
More than 90%	804 (95%)	742 ( 88%)	696 ( 83%)
Total	842	842	842
Grade 10; 1-79 students
Percentage of target scored	RD	WR	MA
Less than 70%	64 ( 14%)	131 ( 29%)	80 ( 18%)
70-90%	111 ( 25%)	151 ( 34%)	144 (32%)
More than 90%	271 (61%)	164 ( 37%)	222 ( 50%)
Total	446	446	446
Grade 10; 80-349 students
Percentage of target scored	RD	WR	MA
Less than 70%	6 ( 2%)	11 ( 3%)	12 ( 3%)
70-90%	30 ( 8%)	88 ( 25%)	101 (29%)
More than 90%	319 (90%)	256 ( 72%)	242 ( 68%)
Total	355	355	355
Grade 10; 349+ students
Percentage of target scored	RD	WR	MA
Less than 70%	0 ( 0%)	3 ( 1%)	11 ( 3%)
70-90%	8 ( 2%)	49 ( 13%)	164 (43%)
More than 90%	374 (98%)	330 ( 86%)	207 ( 54%)
Total	382	382	382
^aThe target was based on a count of SIFs. ^bThe total for the three sets of fourth grades here is 4656 schools, compared with 4743 in Table 1. Differences also appear in Grades 8 and 10. Schools participating in the pilot study, where all students were tested and scored, are excluded from Table 3. Source: Printout supplied by CTB July 8, 1994.

Page 47 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

OPERATIONAL PROBLEMS: THEIR NATURE AND CAUSES

CLAS operations are vast and complex. In over 8,000 schools, more than a million students completed the 1993 tests in Reading, Writing and Mathematics. Just under 1,400,000 booklets were scored (out of 3,300,000 booklets completed). The tasks performed from beginning to end included test-booklet development and production, production of other materials essential to the assessment (such as instructional manuals, answer sheets, Student Information Forms), packaging and shipping of test booklets and other materials to each school district; administering the tests, then assembling and shipping completed test booklets and other materials to processing centers; scoring, data processing, and statistical analysis; and finally report generation, production, and dissemination. In such a large undertaking, every step in data development is critical and there should be quality control at every point. Operations did go awry in CLAS-1993.

Loss of data

Operational problems caused serious loss of critical documents (test booklets, multiple-choice answer sheets, school or class headers, and some others). A document was not returned by a school, or was lost in shipment, or was misplaced or lost by a contractor. And, among documents received, incorrect or missing barcodes sometimes prevented linking of SIFs to student responses; then the student could not be counted in a school score.

We are told that some 270 schools, which suffered data losses for such reasons, were identified for special review in late stages of the 1993 reporting process or when the reports were questioned after release. Approximately 200 of those schools have received or will soon receive revised reports. The remainder are divided between schools where information losses could not be repaired and schools where it appeared that revision was unnecessary.

For CLAS-1993, the data-processing system at CTB was geared toward producing a report for each of about 8,000 schools. Under the pressure of completing the effort, CTB's automated data processing produced and disseminated some reports based on inadequate samples. These reports should have been issued only after correction—late if necessary —

Page 48 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

or not at all. Criteria should be established for the computer to determine in advance which school and other reports are ready for printing and release, and which are to be held up for further review and correction, if needed, before release.

Breakdowns in the management of documents

The plan relied on barcodes as the principal means of collating SIFs, answer sheets, and test booklets from the same student, and linking them with the school. Without that linkage, the student's responses could not contribute to the school report. The operation was defective from beginning to end. For example, the production line for the labels had to be stopped and restarted on occasion. It was the printer' s routine to start the new run with duplicates of the last several barcodes of the previous run. Those duplicates were not systematically weeded out, creating problems at the match-up stage. Barcode numbers were sent to schools unsystematically, so that there was no way to recognize from the barcode what school a stray student paper came from. And, at the end of the process, the technology for scanning information into the computer occasionally created obstacles.

In some schools students attached their own barcode labels, in some the teachers did the job. For whatever reason, some documents lacked proper labels. The school manuals did not demand that labels be checked. The plan for 1994 is somewhat better. Each student attaches the labels to his or her own test materials at a single sitting and enters his or her name on each item. However, this is not a dependable system. Again, the school manual requires no independent check on the accuracy of labeling.

Although the district-coordinator manual described in detail the process for checking-in test materials received from the producer and for distributing them to the schools in the district, and for getting completed materials from the classroom to CTB, no procedure was put in place for independent checking. The same is true of the packing and shipping of test materials by each school.

CTB has made a number of changes in its physical arrangements for receiving materials and entering information into the computer. These should solve a significant fraction of the problems in their operation. In the long run, CLAS needs to build identification of school district, school, student, and document category into the barcode. We understand that this may be impractical until a data base for California schools is

Page 49 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

more fully developed; and we have not thought through the likely cost of such a system.

Another needed control is brought to light by a further example. The 1993 plan called for scoring only one of two open-ended Math problems. Random selection of the problem to score for each student was not realized; the process actually used could have introduced systematic error. The Committee urges that random selection of scoring samples be done by computer, with near-equal representation of all test forms in each school.

Recommendations on administration and quality control

The management structure

Large projects under government auspices are usually carried out under the leadership of a prime contractor. Subcontractors carry out subtasks for which the prime contractor lacks adequate expertise or sufficient staff. The prime contractor has overall responsibility to meet specifications set in the contract and must hold any subcontractor responsible for delegated operations. That is, a clear management structure is laid out, with subcontractors responsible to the prime contractor for completing their respective duties at specified levels of quality, and the prime contractor responsible to the government for quality in all the work.

CLAS-1993 knew that it required wide capabilities, yet its management structure was not much like the usual model. There was no prime contractor; instead, at least three separate contracts were awarded. These were (i) a contract with The Psychological Corporation to print test booklets and other assessment materials and to package and ship them to school districts; (ii) a contract with the Sacramento County Schools to operate scoring centers, and (iii) a contract with CTB to process the data, and to generate and disseminate school and district reports. CLAS staff undertook to coordinate the work of the three contractors and provide overall management. CLAS staff had not only to coordinate, but to monitor the quality of the work and the products of each contractor. Much detailed supervision was needed — a dubious use of valuable CLAS staff time. Since the CLAS staff is small, it should concentrate on those things that only it can do.

Page 50 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

We recommend the structure of prime contractor plus subcontractors, with the prime contractor responsible for monitoring subcontractors. This model would permit CLAS staff to focus on efforts to improve the design of the assessments, their quality and the efficiency with which they are carried out.

Quality control

We judge that many of the operational problems that occurred would have been avoided if explicit quality control procedures had been in place and their adequacy reviewed periodically by CLAS staff. CTB should have a data-receipt and document-control-and-storage system that accounts for each and every document, where it is stored, and where it stands with respect to each step in the operation, including sampling and scoring. Future contracts should require each contractor to develop and implement quality-control and/or quality-assurance procedures for each distinct task. CLAS would do well to require the prime contractor to provide periodic reports on project progress vis-a-vis deadlines, and on the extent to which quality is reaching contractually specified standards.

To summarize, the Committee judges that considerable improvements are possible in CLAS-1995, and some of these are moving forward in 1994. Specifically, we advise:

adopting a prime/subcontractor model,
developing an overall survey-control system involving the contractors, CLAS, the school districts, and schools,
developing and implementing an automated data-receipt and document-control system,
installing quality-control procedures to monitor key operational steps carried out by contractors and the educational system,
making all statistical design, sampling, and estimation the responsibility of an expert, PhD level, survey statistician,
addressing the issue of nonresponse bias in reported statistics,
developing procedures responsive to the needs and concerns of those asked to supply the data, in the hope of gaining a higher level of cooperation,
requiring printed barcodes on each CLAS document, where the barcode is used to identify nature of document school and district, document and, for tests and SIFs, the student.

Page 51 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

ANALYSIS AND REPORTING AT THE SCHOOL LEVEL

The panel was asked to consider the validity and reliability of 1993 and 1994 information. In discussing 1993 findings it will often be convenient to discuss aspects of the 1994 plan that is developing. Some aspects of CLAS-1994, especially accuracy of scores at the student level, are reserved for a later section.

Validity issues

This type of assessment would be judged valid (i) if the test is reporting accurately on the competences students should be acquiring, (ii) if irrelevant features of test tasks and the conditions surrounding them are not making scores better or worse than the students' competence justifies, and (iii) if the reports help educators and officials to improve education.

(i) The public through its representatives and the profession through its working groups determine the goals of education. Opinions differ; controversy is inevitable. A technical panel is not the right body to decide whether CLAS content is in line with the established framework for California education. It is our sense, from press accounts, that the complaints about CLAS-1994 have been dissents about curriculum goals rather than about the fidelity of the tests to the frameworks. The extensive participation of teachers in test development should contribute to fidelity.

There was no formal review in 1993 and 1994 to identify questions likely to be protested when their content became public.⁸ A number of questions touched a nerve in one or another community. The 1994 complaints have reminded Californians of the ill-defined border between political sensitivity and censorship. The dilemma is to be resolved by political mechanisms, not by measurement professionals. The issue is not primarily one of CLAS. CLAS tasks are intended to measure the kinds of classroom exercises that leaders of the profession are advising teachers to use, so as to bring students' thinking to a higher level. Any protest regarding the type of exercise, then, expresses an opinion on the curriculum.

We have no reason to think that the protested tasks were invalid, but some parents did

⁸	The panel understands that such a review is being put in place for CLAS-1995.

Page 52 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

threaten to refuse to let their children take the test. If there were many defections in some school, the sample became unrepresentative and the assessment of the school in 1994 was somewhat invalid. The extent and impact of defection in 1994 should be tracked.

(ii) Validity was improved by a careful review process to weed out cultural biases. Tasks also went through a tryout process which could have detected unintended sources of difficulty.

As an example of “conditions surrounding tests” we mention the pattern being used in Language Arts: a first session in which students read a selection and write briefly to show their understanding, a second “group work” activity on the theme of the selection, and a third class period devoted to an extended written exercise developing the theme. The group activities may enhance readiness for the Writing test or may engender confusion. This is no more than an example of the fact that novelties in assessment should be studied before a claim to validity is pressed,

With regard to multiple-choice tests likewise, we note that sometimes a teacher who abhors “guesswork” trains students never to mark a choice when uncertain; this will hold down their scores, became a considered choice is rarely guesswork. Scores for schools or individuals will be comparatively invalid if the omission rate is much above average. A minimum safeguard when student scores are reported is to flag students with (say) 30% omissions.

(iii) The reports sent to individual schools were extensive and informative. To learn whether CLAS accomplished its purpose, it will be necessary to go beyond CLAS output and learn what the local consequences were. No doubt the long delay in releasing 1993 findings made them less useful than 1994 findings can be. But, to improve 1994 reports, a sample of districts should be asked what they learned from the 1993 report, what difference that made in their work, and what misunderstandings came to light.

A principal feature of the assessment is the 6-point PL scale which runs across subjects. The public will be wrong if it makes the natural assumption that “6” means much the same thing in all subjects. The descriptions in Language Arts seem to fit what any teacher or parent would think of as A level work, differing from the 5 level (A-?) primarily in demanding consistency. The Mathematics descriptions are much stricter. At 5 “responses fully meet expectations” and are “supported by effective arguments using multiple or unique approaches.” To get a 6 the fourth grader must perform like a mathematician: “precise logical arguments” … “responses often exceed expectations.” Naturally, not many students receive high scores.

Page 53 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

We cannot say whether the standard-setters overreached; the National Council of Teachers of Mathematics recommendations are notoriously vague on the subject of “How much is enough?” We do not know how to evaluate the gossip that the CLAS math levels were scaled steeply in order to convey to the public a sense of crisis in mathematics. It does seem that the public will be badly served if the meaning of the PL scale is not roughly constant across areas.

One feature of the reports merits special praise. A school working under difficult circumstances tends to become discouraged when assessments show its students performing far below the State average year after year. This year, CLAS supplied Coastal View School (for example) with performance information on a “Comparison Group” (CG) of schools that roughly matched Coastal View on four background factors such as parental education. If a school is judged, and judges itself, against its CG rather than the State or district as a whole, a large fraction of schools in both elite and disadvantaged communities will be pressed to improve. And the school in a disadvantaged community whose results are superior to those in its CG will be encouraged.

Regrettably, if Coastal View is typical, its local newspaper ignored the newsworthy CG story. A way of sensitizing the press to the relevance of CGs is needed. The inattention of the press becomes understandable when we find, in a suggested press release supplied by CLAS to school districts, this sentence:

“The report for [this district] showed our students achieving better than many other schools in our comparison groups,” Superintendent __ said.

Any editor would delete a claim so hollow that 95 out of 100 schools could truthfully make it.

The CG device should be supplemented with a report showing how, across schools, performance levels relate to community factors. Demographic groups that are doing poorly are dramatic evidence of a policy problem.

Page 54 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Scoring

Scoring is placed here between reliability and validity. It affects both of them.

Problems with mapping rules

We begin by commenting on anomalies associated with a novelty in CLAS. A report is made on a 1-to-6 scale. To be useful the scale must mean the same thing from year to year, apart from clearly announced changes in the rules of the game. Experience may dictate some major changes in the first years of the enterprise; we caution only that users of reports should be led to understand the changes. An example is the addition, in each 1994 Reading form, of 21 multiple-choice (MC) tasks to the single open-ended (OE) task of 1993.

Because of its structure, the Mathematics score presented vexing problems in 1993. Each form comprised 7 MC tasks and 2 OE tasks (of which only one was scored for any student because of cost). We do not question this design, under 1993 conditions. Scorers were to judge OE tasks on a 1-to-4 scale, with the added possibility of recording “4 with distinction.” This is clumsy; a simple 5 would have told the same story. Nor is it evident that Math could not have used a 1-to-6 rating scale as in the other subjects.

More troublesome was the need to combine MC and OE scores. It was decided to “map” the pair of part scores into a single PL. For one illustrative form a committee decided that a PL of 4 would be earned by any of the following OE-MC combinations: 4-3, 4-4, 3-4, 3-5, 3-6, 2-5, 2-6, 2-7, 1-7.⁹ It is proper that educators informed about the subject matter should make these judgments. In this particular map, the subtests usually receive equal weight; 4-4 becomes PL 4; 3-3 becomes PL 3. But there are oddities: At every OE level, students receive no more credit for 86% MC correct than for 71%. This, if noticed, might be hard to justify publicly. Similar but different nonlinearities appear in maps for other test forms.

The judgments should be validated. Such weightings are known in the measurement trade as “clinical judgments”—and the term is pejorative. Experience with other instruments indicates that two committees are unlikely to create the same configural map, and unlikely to agree whenever they go beyond “more is better.” The problem had no large importance for CLAS-1993. But it will have serious implications when maps

⁹	Source: Specimen table in Elementary School report, p. 7.

Page 55 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

reflecting sincere but to some degree unreproducible judgments of a committee affect the reported PL for a student. That becomes especially dubious when students taking different forms are judged on the basis of different maps. Unstable mapping will also matter when 1994 PACs are compared with those of 1993. 1994 forms differ from those of 1993, so their maps will differ unreproducibly if the present procedure for mapping is continued.

An experiment is called for in which entirely independent committees—working only from the PL descriptions—prepare maps for the same form(s). This might show that irregular maps truly reflect professional wisdom. If it suggests instead the caprice of committee processes, the system needs an overhaul. As our Committee has not inquired into the rationale for the nonlinear mappings, we cannot suggest how to produce a more stable system. But the thought of simplification is inescapable.

The problem of merging OE and MC arises in Reading as well as Mathematics in 1994, and checks are needed if a highly judgmental mapping is proposed.

Even the decision in Writing to weight Rhetorical Effectiveness (RE) and Writing Conventions (WC) the same way (85% vs. 15%) in all forms and all years raises technical problems. In older CAP, when interpretation rested mostly on percentile standings, the range of scores did not matter. Now, with attention to the absolute PL, it does. CLAS might be wise to change the weights from 0.85/0.15 to (say) 0.96/0.17, so that averaging two scores would nor cause the shrinkage of range illustrated in what follows.

As a by-product of an improved measurement design, CLAS-1994 runs a grave risk of public misunderstanding. We illustrate with Writing. In Grade 8, two writing specimens will be judged rather than the single one of 1993. The specimens are separately assigned scores on the 1-to-6 scale; for the sake of discussion we presume that these will be averaged to get the PL. (In this paragraph we can ignore the preliminary RE and WC scores.) Doubling the information on a student reduces the student-level SE by roughly 30%, which is all to the good. But because no student is likely to get extreme scores consistently on repeated testings, with more tasks the distribution is pulled toward the middle of the scale. One can see the headlines now: “Performance of top students declines.”

Page 56 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Here is an example based on 278 students in WR-8.¹⁰ Two responses from each student were scored. This is the finding:

	Test A	Test B	Average of A and B
Number at PL 5 and 6	45	54	43.5
Number at PL 1 and 2	19	19	13.5

The differences are moderate, but the drama will build when the story appears in nearly every school—and it will.

A general equating problem, not limited to mapping rules, arises when a school (or its local newspaper) wishes to compare scores in 1994 with those from 1993. No plan for equating has been spelled out; however, some data will be available to show how difficulties of the tests from the two years compare. Strict equating is not to be hoped for. Schools should be warned against overinterpreting apparent changes.

The invisible student

Earlier we described the presence in some scoring samples of papers to which no numerical grade was assigned. Most of these “X” students evidently failed the test by any usual standard. (Some papers may have come from competent students whose poor English proficiency should have exempted them, and some blank papers may have been those of Grade-4 students who had been absent on the day of the test scored, but who had “live booklets” and so were selected for scoring. CLAS should eliminate these sources of confusion.)

Assigning the lowest PL to the irrelevant responses and the doodles was not allowable; they “did not meet the standard for a PL of 1. ” This is literally true. To quote a part of the definition of Level 1 in Reading: “Reading performances at this level demonstrate an understanding of only a word, phrase, or title.” The doodler did not demonstrate such understanding, so a literal-minded scorer had to exclude the student from

¹⁰

Source: Printout supplied June 17, 1974 by CTB, The picture is similar for other samples in Reading and Writing. The school PAC—based on one response per student—is an average of the PAC over the two forms. When two scores for the student were averaged, there were borderline cases. We assigned half the 3.5's to 3 and half to 4.

Page 57 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

the 1 category. CLAS has at least two alternatives for bringing these students visibly into the count. It could adopt a 0-to-6 reporting scale. Or it could revise the instructions to say that “1 … demonstrates no more than an understanding of a word, phrase, or title.” We are equally persuaded that a student who misses the point of a verbal question and wanders off into irrelevance should receive a very low mark in Reading. Whether there should be a special place in the scorers ' Hell for students with illegible writing, we cannot say; but making them invisible is not the cure.

Scoring accuracy in 1993

The accuracy of school scores and student scores is improved by suitable distribution of scorers over responses. We understand that CLAS scoring has followed the desirable practice of “spiralling” in scoring, so that if 6 students in a school respond to the same task their papers are likely to be assigned to 6 different judges. Then the leniency of one judge has no appreciable effect on the school report.

The studies of scorer agreement in the 1993 DTR do not display scorer variability clearly. The panel recommends that for internal purposes CLAS calculate SEs arising from scorer differences alone. We reorganized in this way information from one type of table in the Draft Technical Report.¹¹

Considering scoring alone as a source of error, we obtained the following SEs for the student PL from a single open-ended task.

RD 4	RD 8	RD 10	WR 4	WR 8	WR 10	MA 4	MA 8	MA 8
0.49	0.58	0.52	0.55	0.60	0.68	0.38	0.40	0.38

¹¹

As a check on quality during operational scoring, two readers occasionally were assigned to score the same response without knowing the judgment of the other reader. The tabulations from which we worked make a confusing and unnecessary distinction by arbitrarily distinguishing between Reader 1 and Reader 2 in each pair. A recalculation from the original data would give an answer of about the same magnitude as ours, however.

Our first calculation, for Reading, is based on Table 4.1 of the Draft Technical Report, considering the three grades separately. Medians were calculated for the within-reader standard deviation (SD) and the interrater correlation (r). The formula (1 – r)(SD²) produced an error variance of 0.34 in Grade 8. For comparison, the residual-error variance component (called Vpi,e in Table 4.36) has a median of 0.42 for RD-8. Other calculations were based on Tables 4.2 and 4.4.

Page 58 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

The values for Reading are unsatisfactory.¹² They imply a 90-percent confidence interval on the 6-point scale of about ±0.8 points, leaving out of account all other errors. To some extent, scorer errors average out in the school PAC. But the student-level residual variance contribution of 0.35 in RD-4 includes a scorer component of 0.24.¹³ If 68% of the residual variance at the individual level arose from scorers, it necessarily makes up the same fraction of the corresponding school-level error on the PL scale.

Scoring in Writing was equally unsatisfactory, but there is a twist in the interpretation. The weighting process already discussed shrinks the scorer-only SE in WR 4 from 0.55 to 0.47. After shrinkage, the three values are still large.

As for Mathematics, scoring was comparatively accurate; and the OE score made up only about half of the final PL, reducing the importance of scorer errors. An examination of other reports, some of them unpublished, indicates that CLAS scoring in Mathematics is achieving an accuracy level like that in other projects.

In Reading and Writing there is a serious scoring problem. (Our inspection of the data does not give CLAS credit for its “read-behind” quality controls, but these are at best partial.) Many assessments are having difficulty with scorer error on writing tasks, but there should be room for improvement. Because of the irregular form of reports on scoring from all sources, we have been unable to judge whether CLAS is encountering more difficulty than the other projects.

The publicity releases on CLAS claim that scoring is “accurate” and “consistent”, but the CLAS scoring guides and scorer training are not yet what they should be.¹⁴

Our analyses refer to the PL scale and not the PAC scale. If CLAS plans to use PAC reporting in 1994 it should rework earlier data files to examine exactly how scorer disagreements affect the SE of the PAC.

¹²

The Draft Technical Report (Table 4.13) shows that in RD-4 two scorings of the same paper agree 60% of the time. This is unimpressive. If the second scorer were to assign a “3” to every paper without looking at the response, there would be 45% exact agreements with scores assigned by an actual reader. This is true simply because 45% of all responses are scored “3”.

¹³	A caveat: The samples are not the same; and assumptions had to enter the calculations.

¹⁴

The shortcomings of the scoring system should have been recognized prior to the 1993 assessment; the bad news was evident in a December 1992 document. See the pri,e components in studies beginning at p. 17 of CAP Technical Report, Supplemental Section No. 1. CTB, Dec. 1992. It is a mistake to attribute this variation to fluctuation in the student's behavior, as CLAS evidently did. Assessors outside CLAS have been making much the same misinterpretation down to the present date.

Page 59 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

A step to improve accuracy

Judges scoring open-ended questions should be encouraged to report intermediate values. Unnecessary error is introduced when a judge is forced to call a borderline response either 4.0 or 3.0. Recording 3.5 as the judgment can only make PAC's more accurate. (In determining PAC 4+, the computer would simply divide the count at 3.5 in half.) As for the student, 3.5 is probably closer to the truth than the forced 3.0 or 4.0.

We can suggest a refinement on the use of 3.5 for a borderline judgment. The judge could say that, if forced, he or she would select 3 rather than 4 by recording 3.4 (or, if leaning the other way, 3.6). This makes no heavy demand on the judge, but it further reduces the role of chance. (The final PL scores can be reported as whole numbers; we are not advocating the reporting of refined student scores such as 3.25.)

In Writing, the processing ought to retain one decimal place in each of the two part-scores on a task that will be weighted to get a task PL. The final average PL over tasks would be rounded to a whole number for reporting. Likewise, in Mathematics and Reading where there are two or more open-ended questions, the average of the two scores should be carried to one decimal place before it enters a mapping table or a composite.

The extra decimal in Writing has a special value. The Writing Conventions (WC) score is in the test because many citizens want the school to improve spelling and other “mechanics.” The ostensible 15% weight for it (versus 85% for Rhetorical Effectiveness (RE)) bespeaks the language teachers ' consensus that communication is a more significant goal. In fact, with integer scoring WC has almost no influence. A 4 on RE combines with 1 on WC as (0.85) × 4 + (0.15) × 1 = 3.4 + 0.15 = 3.55, which rounds back to the RE score. WC affects the PL of only the exotically rare student whose RE and WC scores are four or five points apart. When two tasks are averaged, retaining the decimal place would allow WC to “count for something”, as the public is promised.¹⁵

¹⁵

The 4-1 combination that averaged to 3.55 would, if averaged with a task scored 3.3, give an unequivocal 3. Under the integer system, the average would be exactly 3.5; as of last report, no one has decided how that would be rounded. After the shift to decimal scoring, the number of students whose WC “would count” will still be tiny.

Page 60 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Reliability of school scores

The reliability question can be put simply: How far is a school's actual success rate likely to depart from its true rate? The “true” level is thought of as the percentage of students at (say) 4 or above that would be obtained by exhaustive testing of all the eligible students. The “How far?” question is handily answered by the 90-per-cent confidence interval described earlier, and it can be answered without exhaustive testing. From the consistency of students' scores across two or more writing exercises, for example, one can infer the accuracy expected from a test of any length. Where the school success rate is derived from a sample of students, the uncertainty arising from sampling of students is also taken into account.

Despite the long history of reliability theory, the assessment designs of CLAS and kindred projects require new techniques. We have had to adapt established theory to obtain many of the results for this section.

The school-level standard error

At each step in assessment, some action adds or subtracts from the school's net score, bringing it closer to or farther from the true score. The size of each influence is described by a variance component and these add up to the error variance. An analogy to noise will clarify the term “variance.” A radio station sends out a message—a signal—with a certain power. Noise is added by interference in the atmosphere and by disturbance in the listener's vicinity. Each source of noise has its own strength, and these magnitudes accumulate into a total noise power which is like our error variance. The noise may be strong enough to make the message highly uncertain.

Table 4 presents the basic organization of this kind of analysis. A “forms component” arises because forms differ in difficulty. Selection of a few forms from the domain of appropriate forms made the 1993 California average higher or lower than the average over the entire domain would have been. Spiralling more forms in a school tends to neutralize that variation; difficult and easy forms tend to balance out. The theory, recognizing that principle, calls for dividing the component by the number of forms k (third column of Table 4). Table 4 goes on to characterize briefly the other components; more detail is found in the Endnote.

Table 5, derived from the “Multiplier” column of Table 4, gives information of great value in thinking about how to reduce the SE. The most obvious possible moves for CLAS, as it develops, are listed; limited steps along several of these lines were taken in CLAS-1994. Moves affect different components. The magnitudes of the contributions in the 1993 data will suggest where improvement was most needed, and CLAS can use the components to anticipate 1994 SEs, taking all changes in design into account.

Page 61 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Table 4. Components Contributing to the Uncertainty of the School Score

Component	Origin of influence	Multiplier^a	Type of error
Form (f)	Test tasks selected for the year have a particular level of difficulty.	1/k^b	Measurement
School x Form (sf)	Tasks easy in one school are hard in the next because of the good or poor fit to the local instruction.	1/k^b	Measurement
Student (p)	A more able or less able sample of students is drawn for scoring.	(1/n)(1 − n/N)	Sampling
Residual (res)	Positive effects come from luck in drawing a familiar question or an easy one, drawing a lenient scorer, or being alert and attentive. An inattentive scorer may overlook a fault. ^c	(1/n)(1 − n*/N)^d	Measurement
^ak is the number of forms spiralled in a school, n is the number of students scored, and N is the number of students eligible for testing. n* is the average number of students scored per form. ^bThe rule for multiplication is more complicated when forms overlap as they do in Mathematics. This technicality is spelled out in the Endnote. ^cEach such cause has a downside counterpart. ^d The Committee has considered two alternative mathematical models. The second model would change the multiplier to 1/n. The choice has no practical effect on this report, as the term (1 − n*/N) is always 0.83 or larger in our numerical work.

Table 5. Which Contributions to the Standard Error are Reduced by Possible Changes in Measurement Design?

Possible change	f	sf	p	res
Sample more students^a
Spiral more forms^b
Make forms more comparable
Increase periods for testing^a^,^b^,^c
Better scoring controls
^aIncrease cost of scoring ^bImprove coverage of curriculum ^cIncreases demand on instructional time.

Page 62 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

The panel decided at the outset to estimate SEs only in Grades 4 and 10, believing that analyses in Grade 8 would yield similar results. The calculations for Table 6 employed all schools that had given every form of the test under investigation to at least z students, where z is a number that varied from one analysis to another. In Grade-4 Mathematics, where there were 16 forms, z was set at 2; and in Grade 10 it was set at 4. The value of z was set at 7 in Reading and Writing. The cases analyzed for any school-form combination were chosen at random. The same analysis was used to estimate the combined p and res components for Table 7, but to separate these two parts we drew on the “pilot” schools that in CLAS-1993 had administered two forms of a test to every student in the selected grade, both forms being scored. On this procedure, see the Endnote.

A number of approximations and assumptions were required to obtain estimates from multiple analyses with somewhat irregular designs. In making these technical decisions we tried consistently to avoid choices that would tend to overstate SEs. Our compromises in analyzing Writing scores, however, may have led either to overestimation or underestimation in SEs. The matter is spelled out in the Endnote, both for the sake of normal scientific reporting and because CLAS will need to cope with the dilemma in the future. The Endnote documents sources for Table 6 and Table 7.

Our methods, tailored as they were to the data available from CLAS-1993, are by no means the final word; in the future, improved designs can provide better data to evaluate SEs. We recommend that CLAS convene a small group of mathematical statisticians to compare the two models mentioned in the footnote to Table 4 and also to advise on the choice of procedures for estimating CLAS-1994 SEs. For CLAS-1995, we recommend planning for every student in some schools to take at least two test forms and to arrange double scoring for those responses, so as to separate components associated with students, scorers, forms, and their interactions.

Page 63 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Table 6. Variance Contributions that are Treated as Constant over Schools (PAC scale)

	Form component	School × Form component	Form contribution	School × Form contribution
RD 4	0.0031	0.0107	0.0005	0.0018
WR 4	0.0012	0.0134	0.0002	0.0022
MA 4^a	0.0022, 0.0027	0.0002, 0.0069	0.0004	0.0005
RD 10	0.0021	0.0033	0.0003	0.0005
WR 10	0.0092	0.0064	0.0008	0.0005
MA 10^a	0.0138, 0.0048	0, 0.0048	0.0020	0.0003
^aThe first value comes from MC item-sets, the second from OE tasks.

The size of components

Table 6 reports estimates for the two sources of error that are treated as constant over all schools.¹⁶ The “contribution” is the amount added into SE². The components are reported also, so that the reader can evaluate the effect of changing the number of forms in the spiral.

In RD-4 and WR-4, the sf contribution is much greater than the f contribution. Evidently, variation is substantial in those areas with respect to the match of the demands of various forms to the schools' particular curricula and instructional methods. The contribution arising from the MC item-sets in MA-10 is large, and suggests a fault in test construction.

Values ranging from 0.0002 to 0.0008 elsewhere in the contributions columns should not be regarded as negligible. The reader who adds the s and sf contributions and takes the square root (e.g. in MA-4, ) will reach a floor for the SE to which other components can only add. It follows that it was impossible in CLAS-1993 to satisfy the Committee's 2.5% criterion in any grade-area combination. The Committee has deliberately retained 2.5% as a recommended goal for the SE, even though the goal will remain out of reach until the number of forms is appreciably increased. (Reducing the diversity of forms could seriously impair coverage of the curriculum.)

¹⁶

The sf component is sure to be to be greater in some schools than others. Research on this topic is best pursued in comparatively small projects where the statistical information can be interpreted in the light of the test content and the school curriculum. No present knowledge permits an accurate estimate of the sf component for a single school.

Page 64 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Table 7. Variance Contributions that are Affected by n and N (PAC scale)

	Variance component		Contribution in school with N = 60, n = 50		Contribution in school with N = 200, n = 50
	p	res	p	res	p	res
RD 4	0.052	0.123	0.0002	0.0021	0.0008	0.0024
WR4	0.063	0.146	0.0002	0.0025	0.0009	0.0028
MA 4	0.080	0.080	0.0003	0.0015	0.0012	0.0015
RD 10	0.051	0.120	0.0002	0.0021	0.0008	0.0023
WR 10	0.063	0.146	0.0002	0.0027	0.0009	0.0029
MA 10	0.087	0.087	0.0003	0.0016	0.0013	0.0017

This speaks to a policy decision. There is pressure to make public the tasks used in 1993 and 1994. CLAS has followed the practice of other assessments, releasing only a modest number of illustrative tasks. Careful development of new forms takes time. Although additional tasks for CLAS-1995 are under development, there is no possibility of increasing the number of forms per school without reusing many 1993 and 1994 tasks. Reusing tasks that have been made public invites inflation of 1995 results, because some teachers and parents will encourage practice on anticipated test tasks. Such inflation has occurred in other States. The problem will become severe if, in the name of accountability, a plan to impose sanctions on low-scoring schools emerges, as it did in Kentucky. We therefore urge the State Department of Education to resist the demand for disclosure of tasks.

We turn now to Table 7. The components as such deserve little interpretation; they are a building block for meaningful statistics. The estimate of the components is intended to describe an average school; obviously some schools have more varied student bodies than others and their p components would be especially large. The contributions will vary with both n and N. This table assumes that 50 papers were scored, to keep the example simple. The reader can use Table 4 to work out the contributions for other n and N. The p contribution becomes small when most students in the school are scored.

For small schools where scoring was on target, the res contribution was always a major part of the error. For large schools, the p component had noticeable impact. The f or sf contribution was sometimes a dominant factor. The reader should review Table 5 with these facts in mind.

Page 65 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Now we are ready for a bottom-line SE. For RD-4, with N = 60, the values from Table 6 and Table 7 combine:

(0.0005 + 0.0018 + 0.0002 + 0.0021) = 0.0023 + 0.0023 = 0.0046.

Then = 0.068 or 6.8%.

The full set of SEs for these illustrative schools is as follows:

	RD.4	WR.4	MA-4	RD-10	WR-10	MA-10
N = 60	6.8%	7.2%	5.0%	5.6%	6.5%	6.5%
N = 200	7.4%	7.8%	5.8%	6.3%	7.1%	7.3%

Sample size and the SE

It is evident that SEs are generally far above desirable levels. Now we can clarify the role that sampling of students, and in some places undersampling, contributed to the SE.

Table 8 continues the examination of RD-4. Increases in the SE with declining n are gradual. The reader should reflect on the fact that increasing scoring from 100 papers to 200—a costly operation—would have narrowed the confidence interval by only 15%. In RD-4 the constant f + sf contribution utterly dominates the other contributions when sampling fractions are large. The trends with number scored would be the same in other grades and areas.

In a small school, samples of 70% (the approximate 1993 target) are not much less satisfactory then the 100% sample. In the large school, scoring 45% of students matches the SE from 100% in the small school. Scoring 100% of target in the large school and in the small school gives much the same SE. This is evidence that CLAS-1993 adjusted target sizes sensibly, making accuracy similar in large and small schools (although the budget precluded setting adequately large targets.)

Below 50% the SE rises rapidly, and shortfalls in that range had serious consequences especially in small schools.

Among schools where scoring fell below 70% of target, the shortfall amplified the already large error to a more serious level. From the numbers in Table 3, the shortfall seriously affected 342 reports on elementary schools out of nearly 14,000 (counting sub-

Page 66 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Table 8. Increase in RD-4 Standard Error with Scoring of Fewer Students

N = 60	n scored		f,fs contribution	p contribution	res contribution	SE
Target = 44	60	(100%)	0.0023	0.000	0.002	6.3%
	54	(90%)	0.0023	0.000	0.002	6.6%
70% of target = 31	48	(80%)	0.0023	0.000	0.002	6.9%
	42	(70%)	0.0023	0.000	0.003	7.3%
	36	(60%)	0.0023	0.001	0.003	7.7%
	30	(50%)	0.0023	0.001	0.004	8.3%
	24	(40%)	0.0023	0.001	0.005	9.2%
	18	(30%)	0.0023	0.002	0.007	10.4%
	12	(20%)	0.0023	0.004	0.010	12.6%
N = 60	n scored		f,fs contribution	p contribution	res contribution	SE
Target = 50	200	(100%)	0.0023	0.0000	0.0005	5.3%
	180	(90%)	0.0023	0.0000	0.0006	5.4%
70% of target = 35	160	(80%)	0.0023	0.0001	0.0007	5.5%
	140	(70%)	0.0023	0.0001	0.0008	5.7%
	120	(60%)	0.0023	0.0002	0.0009	5.8%
	100	(50%)	0.0023	0.0003	0.0011	6.1%
	80	(40%)	0.0023	0.0004	0.0014	6.4%
	60	(30%)	0.0023	0.0006	0.0020	7.0%
	40	(20%)	0.0023	0.0011	0.0030	8.0%

ject areas separately)—a small proportion, but a not inconsiderable number. In Grade 8, serious shortfall affected 136 reports out of 4,656. In Grade 10, 318 reports out of 3,549 were affected, the higher proportion presumably being traceable to continuation schools. Overall, shortfall in meeting sampling targets was a less pervasive and substantial influence on the accuracy of CLAS-1993 school reports than the fact that in middle-sized and large schools the targets were low.

Will the situation be better in CLAS-1994? Improvements in the sampling plan have now been programmed. We understand that tolerance levels for scoring error have been much tightened, and the important step of lengthening some tests has been taken. Moreover, an appreciably larger number of papers will be scored than in CLAS-1993. All these will lower the SEs.

Page 67 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

SCORES FOR INDIVIDUAL STUDENTS

The need for equating

CLAS-1994 plans to report on individual students in Grade 8. The number of exercises has been increased to increase accuracy.

CLAS is like other similar assessments in not having come to grips with the fact that a design superior for assessing schools creates difficulties at the student level, and vice versa. In a matrix design the luck of the draw determines whether a student gets a comparatively easy test form or a hard one. It is reasonable to suppose that a device can be invented for adjusting student scores upward or downward, depending on the estimated difficulty of forms. CLAS has not developed such a device, but it is not too late to attempt to find a way to adjust 1994 reports on individuals.

We suggest that CLAS's technical advisers adjust open-ended scores separately from multiple-choice scores. Adjustments left to the final PL stage would necessarily be crude because the PL scale is coarse. Adjustment at the part-test level will of course have major implications for the 1994 mapping.

Another problem that all assessments are cognizant of, but none has begun to solve, is “equality of opportunity to learn.” It is agreed that it is unfair to expose a student to criticism if the instruction offered is not a reasonable match to the test. It may be possible to identify schools or classrooms where lack of resources, school disruption, or inadequate teacher preparation has limited the opportunity to learn what CLAS is testing. CLAS has been making worthwhile studies of the possibility. But no proposal is in place for recognizing a student's lack of instructional exposure when CLAS does report scores to parents.

Reliability for individuals

Our analysis in earlier sections was of the school-level PAC. We are now shifting to student-level scores on the 1-6 scale. The numbers to be examined will therefore not be comparable to those reported earlier. And we can set aside the complicated logic of the Endnote. In the pilot studies, students responded to two or more tasks. Student

Page 68 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

scores from 100 schools (for open-ended portions of the exams and separately for multiple-choice sections) were pooled. A conventional two-way analysis of variance estimated components of variance for pupil, task, and a residual. The SE for a student is the square root of the sum of the task and residual components, both divided by the number of tasks the student takes. Because this analysis ignores school boundaries, the f and sf components are automatically included with proper weights.

Matters become complicated when two sections of the test use different kinds of items; then the estimates for the two kinds have to be combined. Another expansion of the analysis is required if the student's response is rated by more than one reader.

What uncertainty is acceptable?

A standard error of, say, 1.0 for a student score would imply a 90-percent confidence interval of ±1.64. Probably everyone would agree that a 3-point spread of uncertainty is unacceptable. It would mean that a student who is truly at 3.0 has more than 1 chance in 20 of scoring above 4.5 (hence being reported at 5) and an equal chance of being reported at 1. The likelihood of serious error exceeds 10%. This type of reasoning about misclassification will help our readers judge what lower levels of error they can accept.

Many misclassifications are inevitable among students near a borderline. A small error will throw a student truly at 3.4 over the border into the 3.5-4.5 range. It takes an error of at least 0.5 to misclassify a student truly at 3.0. The error rate almost evens out when we tally one-step misclassifications, because the student at 3.4 has little chance of being misclassified downward whereas the student at 3.0 has equal risks in both directions.

Table 9 is based on commonplace assumptions that should give good approximations for students in mid-scale. The probability of a misclassification of one step or more tells how likely it is that a student truly at 3.3, for example, who “belongs” in the 3 category, will be placed in some other cell. This kind of error is bound to be frequent. Those proposing to use student-level scores should be much concerned about two-step errors, for example where the student at 3.3 is reported as at 1 or 5.

For the year of this report, we consider an SE of 0.7 tolerable, although 5% of students will have grossly incorrect reports.¹⁷ Our choice of the 0.7 level is a device to simplify our com-

¹⁷

It is pointless to ask whether this standard is more or less severe than the 2.5% proposed for judging school reports. The two reports are on different scales (and the penalties for erroneous reports are not comparable). The SE of 0.7 for individuals is on the same scale as the SE of 0.14 that we considered attainable for school means.

Page 69 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Table 9. Risk of misclassification associated with various SEs

SE	0.8	0.7	0.6	0.5
Probability of misclassification (at least one-step)	0.56	0.51	0.45	0.39
Probability of two-step misclassification	0.08	0.05	0.02	0.01

ments; we are not recommending that State decision makers adopt that standard, because the decision is in the best sense a political one. And we would not speak of an SE of 0.7 as “tolerable” when and if the CLAS report affects the student more strongly than seems likely this year, or when important decisions rest on the CLAS score without regard to other information in the school record. One-step misclassifications would then be a matter for concern. Higher stakes will make the reporting of confidence bands imperative.

We warn of an additional pitfall. Any teacher who compares or ranks students on the basis of PL values is likely to go astray. Students who took the same form can fairly be compared. Unless and until an equating system is in place, CLAS will be unable to compare fairly two students who took different forms.

We do not know what CLAS can do to improve 1994 scoring at this late date. But our earlier proposal to allow intermediate marks such as 3.4 or 3.5 is one inexpensive step. When a judge who believes that a response deserves a 3.5 is forced to call it a “3” or “4”, the report is false by one-half point. Such distortions mount up quickly in the error variance. Statistical summaries of scorer disagreement should be performed as 1994 scoring proceeds, with particular attention to whether errors at certain scoring sites are relatively large. ¹⁸ A letter from a school principal astutely suggests also a study of scorer agreement early and late in a long day.

¹⁸	The appropriate computation would obtain the variance between raters within papers, or, in a crossed design, the usual p, r, and pr components. The correlation coefficient and counts of agreements do not address the question adequately.

Page 70 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Estimated standard errors

Estimating the likely SE for student reports requires assumptions. SEs will be lowered if our proposal to record intermediate ratings is adopted, but we have statistics only from tryouts with whole-number scoring. The Committee does not know how multiple-choice (MC) and open-ended (OE) scores will be combined in Reading, or how the new Constructed Response section will be weighted in Math. Decisions such as these can change the practical meaning of the PL scale and hence the meaning of an SE expressed in those units.

Writing is the simplest area to deal with. The plan is to score two tasks for every eighth grader. We assume that the PL score will simply be the average of two integers in the 1-to-6 range. From 1993 “pilot study” data, we estimate the SE in Grade 8 to be 0.55.¹⁹ This appears to be a satisfactory level of accuracy for reporting student scores.

Reading in 1994 has two parts. There is an open-ended (OE) section of two tasks. Again presuming simple averaging, analysis of a data set like that used for Writing gives an SE close to 0.5. This number is again on the 1-to-6 PL scale. The multiple-choice (MC) section is somehow to be merged with OE to get the final PL. The final SE will surely be lower than 0.5 (if MC is placed on the same scale and the final score is a weighted average.) Because the SE of 0.5 is in the acceptable range, we have not tried to forecast how much combining the parts improves the SE. Thereby we avoid guesswork about a mapping process not yet developed.

In 1994 Mathematics, there are 2 OE, 7 MC, and 8 Constructed Response (CR) tasks. The squared SE for OE is estimated at 0.23²⁰ using the 1-to-6 scale. That for MC is 0.38 using the students' scores on a percent-correct scale. In 1993, each 14% correct was counted as roughly equivalent to one-half point on the 1-to-6 scale, implying a squared SE on that scale of 0.48. Now we must speculate. Suppose that CR has the

¹⁹

This is based on a calculation made for the panel by CTB on June 2, 1994. The PL scores for two tasks were entered in a pupil by task generalizability study. (See Shavelson and Webb, Generalizability theory: A primer, pp. 27 ff. [Newbury Park, CA, Sage, 1991].) The scores apparently were in integer form despite being weighted averages of RE and WC. About 4000 students, distributed over 10 tasks, also took one common form. Scores for each group of students taking one variable form were analyzed; then findings were averaged over groups. In all, about 8000 students entered the calculation.

An important technical fact: Analyses made for the panel and delivered by CTB on June 17, 1994 indicate that the SE is essentially uniform for WR-8 PL's at levels from 2 to 5. The same statement holds for RD-8. This supports an assumption underlying our statements about confidence bands. (There are too few 1's and 6's for a conclusion about SEs at the extreme.)

²⁰	This is based on Tables 4.43 and 4.44 of the Draft Technical Report.

Page 71 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

same SE as MC, and that as a first step those scores are averaged, yielding a squared SE of 0.24 (because a longer test has less error). Then suppose that that composite is averaged with OE. We wind up with a speculative SE of or 0.35.

The estimated errors for all three areas are in the same ballpark. In rethinking the mapping rules CLAS will perhaps, in effect, stretch the scales, so as to counteract the shrinkage inevitable when more tasks are averaged. We doubt that the stretching will raise the SE beyond 0.7. Although we have had to engage in patchwork reasoning on one of the most serious questions before the panel, we conclude that the errors in 1994 Grade-8 student scores are tolerable in all three areas. It is of course imperative that CLAS determine the SEs and confidence bands accurately when judging and mapping are complete, and that this information on uncertainty be communicated effectively.

A FINAL RECOMMENDATION

The Committee recommends that reporting of scores on individual students in 1994 be limited to experimental trials in a few schools or districts. These should be volunteers but they should represent diverse communities and a range of CLAS-1993 performance levels. If this requires reversal of decisions previously made by the State Board of Education and the legislature, we recommend such reversal. The decision is not unthinkable. At some date in 1994 CLAS did abandon its announced plan to report on individuals in Grade 4 in addition to Grade 8.

The assessment community in California and throughout the nation is being pressed to deliver dependable information when the groundwork has not been laid. A well-intentioned and popular ruling can do harm if it ignores potential hazards from rapid action. CLAS-1993 was not sufficiently accurate. Even recognizing that improvements have been made and will continue to be made, CLAS-1994 is still a trial run to verify that quality control is adequate. Significant problems remain to be solved at the level of school assessment and reporting. We advise against embarking on large-scale reporting of student scores until CLAS has demonstrated its ability to deliver consistently dependable reports on schools.

An inescapable dilemma: An assessment that tries to report at the school level and also at the student level must compromise. What improves the validity of one report (within a fixed amount of testing and scoring time) will impoverish the other. This policy

Page 72 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

choice seems not to have been recognized, let alone resolved.

Equating of forms and equal opportunity to learn are vital concerns. A further year of preparatory work would allow for attention to these issues and for studying the way students, parents, and teachers react to and use reports on individuals. We did find evidence of adequate reliability in pilot runs of the 1994 design, but the analysis was not based on regular field trials and rested in part on speculation about scoring rules yet to be developed.

We applaud the energy and imagination that have gone into CLAS to this point. We would not wish to see confidence in its potential undercut by premature expansion and extension.

ADDENDUM

(by Lee J. Cronbach, December 1994)

I comment here on ideas that have surfaced since the Select Committee report was prepared. They emerged in my conversations with various measurement specialists, but they represent work in progress, not documented proposals. I mention them for consideration by persons making analyses similar to ours in other contexts.

We used the finite model for converting estimated variance components into school-level standard errors, but we did not use a finite correction in estimating variance components. David Wiley informs me that the finite model will be used at both stages of analysis for the 1994 CLAS.
It appears that with the finite model students should be identified by the class in which they received the relevant instruction. In the analysis of CLAS 1993, a component such as the class-by-task interaction (possibly reflecting a teacher's emphasis) is included with the pupil-by-task interaction. But if all classes are appropriately represented in the sample, the ct interaction does not contribute to error in the school score.
At one place in the End Note we introduced n*, the average number of pupils per form, to recognize that this varied from form to form within a school. It now appears

Page 73 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

that the finite correction on the pf component is a function not of the simple average but of the harmonic mean. Wiley and I have examined this in a limited way but a proper algebraic proof remains to be laid out.

I believe that these changes in analysis (not all of which would have been practical with CLAS 1993) would not change the conclusions of our report, although of course specific numbers would be altered.

I mention also that erroneous numbers at two points in the original report are corrected in the version reproduced here. On p.37, in the second paragraph, “28%-44%” replaces “19%-52%”. And on p. 68, at midpage, “1” replaces “2”.

Page 74 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

End Note

Standard error calculations for school scores

At the heart of the panel's analysis of the quality of CLAS information on schools is the standard error (SE), the index of uncertainty for school scores. To estimate this SE, the panel had to go beyond available formulas, so this note must elaborate. With an eye to the broad California audience, however, we write in as nontechnical style as the content allows. Note especially that we omit the words for “estimates of” and “approximately” and the mathematician's symbols for them, where they would be required in a technical publication.

The persons mainly responsible for the decisions about analysis discussed here were Lee J. Cronbach and David E. Wiley. We acknowledge a key suggestion from Haggai Kupermintz.

It will fix ideas to speak only of the Reading test in Grade 4. Each student devoted a period to answering a single question. Forms (questions) were spiralled over students. Responses were scored on a 1-to-6 scale referring to distinct “performance levels.” Although some papers were scored twice for the sake of quality control, just one of the two scores was used.

Our formula evaluates the score for the school. In Reading, a cut between 3 and 4 separated off students scoring 4 or better; from this came the percentage above cut, or PAC. Student scores 4, 5, and 6 were recoded as 1, all others as 0. The analysis would apply to the original scores if an SE for the school mean is wanted.

The main data for the study of error were counts of 1's and 0's for each form-school combination, in a subset of the CLAS-1993 scores. A set of many schools was analyzed together. We had as supplementary information pilot studies where each of a school's students in Grade 4 had been scored on two Reading tasks. There were several such pairs of tasks, and we analyzed one pair at a time.

Features of the data that required novel analysis included the following:

Multiple forms were used in any school (“spiralling” or “matrixing”).
Numbers of students scored varied with the school.
We judged that student bodies should be treated as finite populations, requiring use of a “finite population correction.”

Page 75 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

We judged that forms should be regarded as samples from a large domain of acceptable tasks, and that the SE should recognize that source of variation as well as variation from sampling students.
CLAS-1993 had previously reported SEs calculated for each school in turn, and the panel set out to make better estimates from within-school data.¹
In some tests other than Reading, student scores on the 1-to-6 scale had been reached by nonlinear combination of scores from two types of task.
There were irregularities in spiralling. Within some schools, this or that form might be taken by three times as many students as an alternative form. CLAS adjusted the score for the school (“rescaling ”), to recognize that otherwise the school report might give excess weight to easy forms, or to difficult forms.² This adjustment was applied to the final score, not to single forms or students.

The formula we developed is rooted in the statistical literature. Among relevant sources are W. G. Cochran, Sampling techniques, ed. 3, esp. pp. 388-391 (New York, Wiley, 1977); and Lee J. Cronbach and others, The dependability of behavioral measurements, esp. pp. 215-220 (New York, Wiley, 1972). The basic model regards a score as made up of a number of components. One subset described the person or group being measured; other components are sources of uncertainty or error.

Our logic will be clearer if we present two stages of analysis separately. The first stage investigates three quantities:

σ²(p)	The variation of student true scores (those that would be obtained, hypothetically, by averaging scores from an extremely large number of forms).
σ²(pf)	The variation across forms of a student's form-specific true scores (an average over an extremely large number of trials).
σ²(e)	Nonreproducible variation (difference between the true score on a form and the performance score) arising chiefly from fluctuation in the efficiency of students and scorers.

¹	This initial decision appears to have been unwise; the SE calculated for a single school is subject to excessive sampling error. In this report we turned to simultaneous calculations on large files.

²

D.E. Wiley. Scaling performance levels to a common metric with test tasks from separate test forms. Appendix to Draft Technical Report (DTR), CLAS, April 19, 1994. (Monterey, CA, CTB/McGraw-Hill). We believe that the rescaling does not systematically increase or decrease the components that enter our analysis. The basis for the belief is the fact that in any school where the same number of students take every form, the rescaling does not change the scores or the school distribution. We found no way to account for whatever small change in the PAC results from rescaling.

Page 76 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

The last two combine to form σ²(pf,e)—referred to as “res” (for residual) in the body of the report

The constants to be used in the formula are n, N, k, and m.N can be thought of as the school enrollment in Grade 4, and n as the number of students scored.³ In Reading and Writing the number of forms spiralled in a school is k. In one place we use the average number of students scored per form: n* = n/k.

A standard analysis of variance produces the so-called “Mean Square within [forms]” (MSw). This estimates the sum of σ²(p) and σ²(pf,e), but these should be weighted differently.

Because the two values cannot be separated in the main data, we first went to pilot study data where it was possible to separate σ²(p) from σ²(pf,e). A ratio m (= σ²(p) divided by σ²(p) + σ²(pf,e)) was calculated for several sets of pilot data.⁴ Values of m varied around 0.3 in RD-4, WR-4, RD-8, and WR-8. (There were no Grade-10 pilot data.) The data available suggested using 0.5 for m in MA-4 and MA-10. Then we defined these estimates:

for σ²(p), m times MSw;

for σ²(pf,e), (1 − m) times MSw.

The first stage (incomplete) formula then is:

The more students sampled, the smaller the contribution of each source to the SE; the multiplier (1/n) in the formula allows for that. The “finite correction” (1 − n/N) recognizes that if all students in a school are tested no variation arises from sampling of students. The model set forth here suggests another finite correction on the pf term, to recognize that the sample size for any form is (on average) n*. The e term should have

³	In empirical work N was based on a count of Student Information Forms, and might be less than the enrollment.

⁴

These studies originally produced DTR Table 4.37 and similar tables. At our request CTB recoded the Reading and Writing data to the 1/0 scale and analyzed to obtain within-school p and res components for dozens of schools where two forms had been given to the same students and scored. In Mathematics it was impractical to recode to the 1/0 scale because of the division into MC and OE sections. The adjusted p and pf,e components in the pilot analysis were observed to be nearly equal and that was the basis for setting the value of m in Mathematics at 0.5.

Page 77 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

no finite correction. As we have no way to separate e from pf, the multiplier slightly understates the e contribution.⁵

The components omitted from discussion so far are these:

σ²(s)	The variation among true school PACs.
σ²(sf)	The variation in the school's true PAC from form to form.
σ²(f)	Variation among averages for the forms, considering all schools.

The s variance represents information about schools' true student performance, and does not enter the standard error.

If forms in the assessment are regarded as random samples from a domain of suitable tasks—as is customary in present performance assessments —then the selection of forms constitutes a source of random measurement error. If CLAS develops a plan identifying particular subareas of content (“strata”) within a field and specifying the emphasis to be placed on each one, and then creates sets of forms to fit that pattern, this will in subtle ways redefine what is measured. Such test construction (plus suitable pilot-study design) permits analysis that subdivides our f and sf components into stratum differences (which are then “true variance”) and task-within-stratum differences that count as error. Stratification is an important issue that no performance assessment is yet ready to deal with, so we necessarily treat the entire f and sf components as measurement error.

The last step in obtaining the full SE is to calculate

which is treated as constant for all schools.⁶ This is added to the stage-1 sum. The square root of the grand total is the SE.

⁵	Dan Horvitz has developed an alternative model starting with the concept of a two-stage sample. It suggests applying no finite correction to the pf term. The difference is practically unimportant because this Endnote 's correction for that term always exceeds 0.83.

⁶	The sf component probably varies with the school. In some circumstances it would be sensible to estimate an average value for schools of a defined type, but values calculated for single schools are unstable.

Page 78 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

Analyses adapted to the area

The analysis for Reading was straightforward, once we had found an efficient approach through trial and error. A file of Grade-4 data consisted of 222 schools where 7 or more students had been scored on each of the 6 forms (one form per student). Where there were more than 7 students per school-form combination the number was cut back at random; then analysis of variance was carried out. (Reported by CTB on July 19, 1994.) This was the basis for estimating components required in the formula. The Grade-10 analysis was similar save that there were 750 schools. The samples for all our estimates of components tended to consist of larger schools, but variance components probably have little systematic relation to school size.

The Mathematics score in 1993 posed a special problem. There were 8 multiple-choice (MC) tests, and each of these was paired with its own 2 open-end (OE) tasks. For any student, one of the two OE forms was scored. The 16 combinations were not independent. It was necessary to estimate separately effects associated with MC tasks, OE tasks nested within MC, and their combination. If we label the corresponding simple effects m, and m,om, and the interactions with school as sm, and so, som, then the expression for the combined f and sf contributions changes to

If the number of OE or MC tasks changes, the multipliers in the formula will change accordingly.

The necessary components for Grade 4 were estimated from a file on 184 schools that had all 16 forms spiralled within the school, with at least 2 students per combination; again, larger cells were cut back to size 2. The story in Grade 10 is the same, with 98 schools and 4 students per combination. The analyses of variance were reported by CTB on July 16, 1994.

In the Writing design for Grade 10, forms were again not independent. (See Draft Technical Report, Table 1.1 and Table 1.3.) Six Reading selections were spiralled over students. Students responding to a given Reading form were then assigned to one of two Writing tasks. We treated the data as if there were 12 forms, which tends to understate the SE. On the other hand, the plan recognized four types of writing (e.g. speculation, reflective essay), and these were equally represented among the 12 tasks. Taking this stratification into account would potentially reduce the SE. It is doubtful that the benefit from stratification can be assessed without a radically new pilot-study design. The

Page 79 Cite

Suggested Citation:"Appendix: Sampling and Statistical Procedures Used in the California Learning Assessment System." National Research Council. 1995. A Valedictory: Reflections on 60 Years in Educational Testing. Washington, DC: The National Academies Press. doi: 10.17226/9244.

×

data from Grade 10 came from 351 schools, with 7 students per cell. (Analysis of variance reported July 19, 1994.)

Grade 4 presented 6 forms, four calling for expressive and two for persuasive writing. Independence was violated by using the same writing question in two forms, the difference between forms being the associated reading selection. Analyzing as we did with k = 6 tends to understate the SE. Ignoring the stratification may work in the opposite direction. The file of data came from 217 schools with 7 students per cell.