or even takeover of the school system from outside. If a hospital is told that it must gradually bring its mortality rate in surgery down to zero, the obvious ploy is to stop operating on patients who are at risk. A hospital would not elevate its initial death rate in order to have more room for improvement. But in education, when payment on the basis of pretest/posttest differences was tried as a policy, schools became very clever about how to get low pretest scores.

A tale told about Kentucky is too cute to be true, but it makes my point. It is in Grade 4 that poor assessment outcomes trigger Kentucky elementary school sanctions. So, the story goes, one school came up with the clever idea of not promoting into the fourth grade the weakest fraction of the third-grade class. They would be taught in a third-grade classroom for another year and, when the next assessment was safely over, would be boosted directly to Grade 5. Even if that story isn't true, there will be true stories like it.

WRESTLING WITH MEASUREMENT ERROR

If teachers are far short of the insight needed to deliver what the new assessments are asking of them, so are test developers, specialists in psychometrics, and the organizations that run the assessments and analyses. As I am about to refer to error after error in handling psychometric questions, let me put into the record that I make errors in great number. In normal professional activities, I have always protected myself by associating with reflective colleagues, and by using graduate students as bomb sniffers. It takes time for these multiple rounds of collegial review and revision. In the current rush to 2000, we do not have the luxury of getting an analysis right before a design is put in place. A request for proposal for drawing a sample of a certain kind, or writing a test to certain specifications, has to be issued by the project before the technical panel advising on the assessment plan has begun to digest the information from last year's trial run.

In fact, one problem that our report on CLAS highlighted was the substantial variation across the judges scoring 1993 writing exercises; this did not jibe with the program's public claims of accurate scoring. Buried in a thick technical report (December 1992), we found neat tables of statistics on scoring error from pilot studies. These could and should have sent a warning signal. But the technical research was only a gesture, its findings filed without interpretation.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 8
A Valedictory: Reflections on 60 Years in Educational Testing or even takeover of the school system from outside. If a hospital is told that it must gradually bring its mortality rate in surgery down to zero, the obvious ploy is to stop operating on patients who are at risk. A hospital would not elevate its initial death rate in order to have more room for improvement. But in education, when payment on the basis of pretest/posttest differences was tried as a policy, schools became very clever about how to get low pretest scores. A tale told about Kentucky is too cute to be true, but it makes my point. It is in Grade 4 that poor assessment outcomes trigger Kentucky elementary school sanctions. So, the story goes, one school came up with the clever idea of not promoting into the fourth grade the weakest fraction of the third-grade class. They would be taught in a third-grade classroom for another year and, when the next assessment was safely over, would be boosted directly to Grade 5. Even if that story isn't true, there will be true stories like it. WRESTLING WITH MEASUREMENT ERROR If teachers are far short of the insight needed to deliver what the new assessments are asking of them, so are test developers, specialists in psychometrics, and the organizations that run the assessments and analyses. As I am about to refer to error after error in handling psychometric questions, let me put into the record that I make errors in great number. In normal professional activities, I have always protected myself by associating with reflective colleagues, and by using graduate students as bomb sniffers. It takes time for these multiple rounds of collegial review and revision. In the current rush to 2000, we do not have the luxury of getting an analysis right before a design is put in place. A request for proposal for drawing a sample of a certain kind, or writing a test to certain specifications, has to be issued by the project before the technical panel advising on the assessment plan has begun to digest the information from last year's trial run. In fact, one problem that our report on CLAS highlighted was the substantial variation across the judges scoring 1993 writing exercises; this did not jibe with the program's public claims of accurate scoring. Buried in a thick technical report (December 1992), we found neat tables of statistics on scoring error from pilot studies. These could and should have sent a warning signal. But the technical research was only a gesture, its findings filed without interpretation.

OCR for page 8
A Valedictory: Reflections on 60 Years in Educational Testing I have been encouraged by the board to insert, in this printed version of my remarks, some comments about the events surrounding CLAS. CLAS had promised to deliver school reports on 1993 tests when there had been inadequate piloting of the instruments and of managing a great volume of test papers. That promise to report was extorted by the legislature as a condition of funding. The report of our review panel was able to say that CLAS had been pioneering along profitable lines and had handled some important concerns. We found no fault with its much-criticized decision to score only a fraction of pupil responses to get a school score, but we did find the sampling plan and execution unsatisfactory. The program accepted the criticisms, having already begun on its own to revise plans for 1994 operations. The least tractable problem was and is error of measurement, and CLAS's difficulties seem to have matched experience in other states. Assessment time is available for only a limited number of cognitive “performance” tasks, and multiple scoring of papers is costly. Enough change had been made in the tests and scoring plans that the panel could be cautiously optimistic about 1994 school-level reports. The panel recommended against the planned reporting of individual scores for eighth-graders because pupil-level errors were likely to be unacceptably large. The state superintendent ruled that CLAS would not release individual scores except in schools in which their reliability had been verified. The legislature authorized a 1995 budget for CLAS, with changes designed to quiet major public objections of a nontechnical character. The governor vetoed the legislation. Governor Wilson insists on the reporting of individual scores as the primary mission of the program and would rely on a multiple-choice examination because, with more questions per class period, it could achieve acceptable reliability. That vision places a premium on accountability and pushes against the current educational reform. I return now to the gap between the demands of assessments and the state of our technical art. When CLAS scores for schools appeared in the local newspaper in March, I was shocked to find not only an absence of standard errors, but also the absence of any hint that findings were subject to error. I expressed this shock to a top-ranking psychometric expert I ran into. He assured me that standard errors were included in the report that went to the schools, though not in the report going to the newspapers. And he rattled off, I think with approval, the obvious formula that had been used. On the basis of my present understanding of CLAS—not my first impression—I have to say that not only was the formula the wrong formula, but also that the entire structure of the error analysis was nonsense. In the first instance, I suspect, some technician reached for a handy textbook formula, and no one was given responsibility for techni-

OCR for page 8
A Valedictory: Reflections on 60 Years in Educational Testing cal oversight and review. A review, if attempted, would probably have been insufficient. Finding a proper formula for the standard error (SE) strained to the limit the competence of the supposedly expert panel. I go into technical detail here, because I want you to understand the layers-within-layers nature of these problems, and the near-helplessness of experts working in snatches of time as consultants or outside observers. The main CLAS report in, say, reading was a string of percentages corresponding to six score levels. Accompanying each percentage was its SE. Thus, Golden Poppy School had 109 pupils scored: 1.8 percent scored at Level 5, and the SE was given as 1.3 percent; 54 percent scored at 3, and the SE was given as 4.8 percent. The formula used is , a version of the usual sampling error of the mean (p is the proportion and q=1 - p). The main fault is that the errors are intercorrelated. To make any sense of those numbers, you would have to locate the six percentages as a point in 5-space and surround the point with a hyperellipsoid, locating plausible percentage vectors for true score. (I don't think that mathematical idea could be stated more simply.) You can imagine how useless a sound interpretation of such numbers would be to the California public. In its first overloaded meeting, our panel was asked to review the scoring plan for CLAS 1994, with an eye to the expected SEs. We were told rather blithely that, for logistical reasons, the final decision on the plan for choosing papers to score had to be made within the next 48 hours. So we tried to wrestle with this while dealing with quite a few other agenda. The first thing we did was to insist that the SE be attached to a cumulative percentage. At Golden Poppy, 27 percent were at or above 4; the SE for that avoids correlated error. The panel also agreed rapidly that the student body in Grade 4 at this school this year is finite, so a finite-correction multiplier is needed. The panel therefore proposed to determine 1994 sampling rules by the formula where n is the sample, N the Grade 4 enrollment, and p is now the cumulative proportion above a cut. That formula, which the panel proposed as a basis for choosing n, was kept in the picture for weeks before the panel rejected it (long after the 1994 operations could use that advice). During this first meeting, under an agenda item remote from scoring plans, the panel had laid down a principle it lost sight of in moving to a new topic, and we were slow to spot the contradiction. The test purports to represent competence in the reading domain, not on the particular tasks used. The SE, the panel said, must recognize the sampling of tasks and other measurement errors, as well as the sampling of pupils. The pq formulas look only at pupil sampling, so they badly understate error. This thesis is not entirely new. But just what to do with school-level percentages in the irregular designs of the CLAS assessment was so uncertain that the panel never did agree on some details of the calculation.