I am not free to cite the particular misinterpretation, but it had to do with scoring and that is worth a paragraph here. As scorer error has been a technical concern for 80 years,4 you might think we would know how to appraise scorer accuracy. We don't. The literature is chaotic with alternative methods of summary that sometimes seem designed to not face the truth. One favorite simple scheme is to report the percentage of times scorers agree when they independently score the same student responses. To people who adopt that routine, a finding of 60 percent agreement is likely to be the basis for an announcement that “our scoring is accurate. ” I have seen one such instance, not atypical, in which 45 percent of students scored at 3 on the scale. Thus, if a check scorer were to assign a 3 to every paper, without reading the response, the percentage of agreement would be 45. From that baseline, 60 percent isn't very far along the road to perfection.

LIMITATIONS ON TESTING

Professional organizations are not prepared to protect the public and the schoolchildren against irresponsible promotion and the advertising of inaccurate results as accurate. No amount of adverse comment by professionals could get the 1980s Department of Education to abandon its totally misleading wall chart. You may recall that it ranked the states according to the mean scores of students who volunteered to take the Scholastic Aptitude Test (SAT). That was as unrepresentative a sample as you could get. But it was impossible to drive out the chart device under the administrations of those years. It has often taken either a press outbreak or a political fight to get major reviews. The CLAS review wouldn't have happened if there had been no Los Angeles Times attack. NRC wouldn't have done the thorough job it did on the General Aptitude Test Battery5 if there hadn't been congressional dispute. Over and over we see that it takes a political event or a public relations disaster before attention zeroes in on the quality of a testing program.

George Madaus and some others have been arguing that the professional test standards should be made enforceable. They were designed to be an educational device, and I

4  

Starch, D., and Elliott, E.C. (1913). Reliability of grading work in mathematics. School Review 21: 254-259.

5  

Hartigan, J.A., and Wigdor, A.K., eds. (1989). Fairness in Employment Testing: Validity Generalization, Minority General Aptitude Test Battery. Washington, D.C.: National Academy Press.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 13
A Valedictory: Reflections on 60 Years in Educational Testing I am not free to cite the particular misinterpretation, but it had to do with scoring and that is worth a paragraph here. As scorer error has been a technical concern for 80 years,4 you might think we would know how to appraise scorer accuracy. We don't. The literature is chaotic with alternative methods of summary that sometimes seem designed to not face the truth. One favorite simple scheme is to report the percentage of times scorers agree when they independently score the same student responses. To people who adopt that routine, a finding of 60 percent agreement is likely to be the basis for an announcement that “our scoring is accurate. ” I have seen one such instance, not atypical, in which 45 percent of students scored at 3 on the scale. Thus, if a check scorer were to assign a 3 to every paper, without reading the response, the percentage of agreement would be 45. From that baseline, 60 percent isn't very far along the road to perfection. LIMITATIONS ON TESTING Professional organizations are not prepared to protect the public and the schoolchildren against irresponsible promotion and the advertising of inaccurate results as accurate. No amount of adverse comment by professionals could get the 1980s Department of Education to abandon its totally misleading wall chart. You may recall that it ranked the states according to the mean scores of students who volunteered to take the Scholastic Aptitude Test (SAT). That was as unrepresentative a sample as you could get. But it was impossible to drive out the chart device under the administrations of those years. It has often taken either a press outbreak or a political fight to get major reviews. The CLAS review wouldn't have happened if there had been no Los Angeles Times attack. NRC wouldn't have done the thorough job it did on the General Aptitude Test Battery5 if there hadn't been congressional dispute. Over and over we see that it takes a political event or a public relations disaster before attention zeroes in on the quality of a testing program. George Madaus and some others have been arguing that the professional test standards should be made enforceable. They were designed to be an educational device, and I 4   Starch, D., and Elliott, E.C. (1913). Reliability of grading work in mathematics. School Review 21: 254-259. 5   Hartigan, J.A., and Wigdor, A.K., eds. (1989). Fairness in Employment Testing: Validity Generalization, Minority General Aptitude Test Battery. Washington, D.C.: National Academy Press.

OCR for page 13
A Valedictory: Reflections on 60 Years in Educational Testing would hate to see that change. The committee I chaired, which produced the first version of the standards in 1954, grew out of an American Psychological Association Committee on Ethical Standards. Among other ideas, that committee had suggested setting up a seal of approval for tests that were of high quality. Our group rejected that notion from the start, primarily because any test has many possible uses, and no official stamp of approval could fence off the sound uses of the test from unacceptable uses or uses not yet well researched. Our committee aimed to set down the questions publishers should answer so that a trained test user could decide how adequate a test would be for the local purpose. We wanted not to discourage trial of new tests and new applications. We did urge test developers to limit claims, but we expected users to pioneer new applications. If we certify Test X for Use 8, there is a strong hint that practitioners shouldn't be trying it for Uses 7 and 9. Of course you should be trying out any reasonable application and checking on the quality of the result. Standards committees have regarded tests as emerging in an orderly market in which a test would, over several years, find one or more niches. In no way can a code that was designed to promote professional use of documentation, released when the test is marketed at the end of a developmental research period, be applied to programs that are rushed into operation to meet politicians' deadlines. In the new assessments, documentation seems likely to lag two years behind application of a test to shape the fates of its targets. The National Educational Standards and Improvement Council (NESIC) has been given the awesome task of certifying assessments as meeting professional standards. I doubt that standards can be written that would be definite enough to be enforceable yet general enough to apply over even a limited area of testing. Even in so mature an area as mathematics testing, in which the National Council of Teachers of Mathematics has laid excellent groundwork, one cannot apply stereotyped questions to the instruments educators have devised. Educators' judgments should not be circumscribed by the psychometric specialist's enthusiasm for intertask correlations or linear combinations of test scores. But this means that every assessment requires invention of new methods of psychometric checking. I am still somewhat numbed by the mismatch between the kinds of score our most sophisticated procedures have dealt with in the past and the structures of test forms and scoring rules I see in some current assessments. I do not say that the teachers who developed the forms were wrong. I say only that I do not expect a priori specification of analyses to define the work that would properly defend a novel assessment. I hope that the board can invent a device for this social need that is right for these times, and not try to resolve the problems within the already overextended test standards.