Even when we phased in the new kind of formula for our post hoc analyses, we asked the contractor to calculate SEs for each school in turn. This continued for weeks, even down to the semifinal report we took to Sacramento for review. When we finally paused to think, it was obvious that single-school estimates are highly unstable. So we shifted to estimates from multischool files. Why did we invest so much computer time in school-by-school analysis? Simply because the CLAS report had done that, and we kept amending the original scheme, instead of starting afresh. This story should give some sense of how lost people get—even people with a good deal of experience—when rushed into a situation that has little precedent. Enough of true confessions. Enough of technical talk.
The public should have been told about the SEs. If you are ready to pay a larger price to live in the Number One school district in Monterey, the newspapers should be telling you whether that ranking is a solid fact. Perhaps four schools prepared students well enough to be in the running for the top position, and only measurement error determined the winner. The public can understand the uncertainty if given information expressed in overlapping confidence bands. And it is not just lay persons who need SEs. In one program, state testing directors were told the SE for a state-level report that would result if they went through the pains of careful stratified sampling. That SE was large enough to kill all interest in the state-level report.
NONSCIENTIFC MOTIVES IN ASSESSMENT ORGANIZATIONS
Organizations running assessments understandably have agendas that run beyond supplying the whole truth. They go to great lengths to hold the enthusiasm of an army of cooperating educators. They keep evidence from trial runs close to the chest because opponents with contrary philosophies will turn to political uses any honestly reported statistics about limitations. Apart from possible personal gains, the developers are building institutions. And they are often driven by organizations unconcerned with testing that are using the test to gain other ends.
A clear example is the high school Armed Services Vocational Aptitude Battery (ASVAB), which was created primarily to give military recruiters a list of promising targets for a sales pitch. In the mid-1960s, ASVAB was a disastrously bad instrument or system. When the original plan was sketched, the personnel psychologists in the military services rejected it.
Therefore, the Pentagon turned the project over to a contractor I have never identified. That outfit was so remote from the world of testing and counseling that they offered the high school youngster a printout showing a profile of eight scores, on a page with green, yellow, and red areas. The message was: “Find your scores up in the green area, and use them to enter the occupation catalog to see what jobs you have the talent for.” Apart from all else that was wrong, the scores in that profile were so highly correlated that tens of thousands of examinees had profiles entirely in the red area. What kind of guidance are you giving when all signals say “Stop”?
I was outraged enough and, for once, smart enough to get my review into the Congressional Record, instead of the Buros Yearbook that had commissioned it. And a Quaker group, pursuing its agenda, generated enough publicity that the colored profile, and a few other equally silly features, were eliminated. No fundamental changes were made in that period. But, before long, the recruiters asking schools to sign up for the next year's tests were telling the schools about these changes and claiming that I was now fully satisfied. I was not. I might have devoted 10 years to harrying ASVAB, but I didn 't. ASVAB won the war by default. Another principle from James March is that, in the end, it is the people who stay with an office or activity for years who make the policy decisions, not the people who beat a drum or wield a cudgel for a short time and leave the field. (I have no opinion on the ASVAB as it is today.)
The larger point in this section is that the organization of large assessments, just because they are large, requires them to conceal facts that cast a shadow. One project, for which I was a statistical analyst, produced achievement tests that showed profiles for types of competences or errors within a school subject. I found that one score being fed back to teachers had a negative reliability coefficient. I told the substantive specialist that this was shocking, and that the score should be suppressed. I learned, through a back channel, that the project director broke off an important meeting in some agitation because he “had to get back to the office, where a statistician was causing trouble.”
I understand now that far-flung and diverse collaborators in an educational change are fickle and must not be given cause to doubt. The staff that closed ranks against my unwelcome fact was doing right by the reform movement it served. A whole branch of an organization may be set up to promote faith in the assessment; its ethics are not those of assessors. In one instance, a technical report was based on preliminary data assembled to inform persons within the project. The report was fully honest, including statements as to which statistics could be misinterpreted and what cautions were needed. The promotion branch had a bulletin that went to thousands of educators who are potential collaborators. It put into the bulletin a summary of the most positive numbers, with no mention of the cautions and probable misinterpretations.