can determine from the cross-tabulations provided by NAEP that the agreement rate was in every case only trivially higher than chance. 6
The extent to which discrepancies among alternative tests or rubrics matters depends on how the results are used. If, for example, a test is used as one of several indicators of improvement or only to support inferences about the approximate magnitude of change, modest discrepancies among tests might be of little consequence. On the other hand, if schools are to be rewarded or sanctioned based solely on a fixed numerical criterion for changes in test scores, the underlying limitations matter far more, for they mean that decisions about test construction, often unrelated to decisions about the aspects of performance for which schools are supposed to be held accountable, will influence who is rewarded or punished.
Adjusted (partial) estimates of performance are often unstable. In many systems, school-level scores—for example, mean or median scores for students in a given grade—are the basis for accountability. Because school averages are strongly influenced by students' backgrounds, some programs adjust scores to take into account the limited background information available, such as the percentage of students from minority groups or the percentage receiving free or reduced-priced school lunches. The purpose of these adjustments is to provide a "fair" index of school effectiveness.
Controversy continues about the adequacy of indices of school effectiveness, but the evidence accumulated over the past two decades suggests wariness. First, such indices have been found in some studies to be inconsistent across grades and subject areas (e.g., Mandeville and Anderson, 1987), raising the prospect that the often limited scores available will inadequately measure school outputs and may lead to erroneous conclusions. Second, there is some evidence that the rankings of schools can be sensitive to the particular statistical model used to control for background information (e.g., Frechtling, 1982), although there is also evidence that they may be reasonably stable across variations within a given class of model (Clotfelter and Ladd, 1995). This model dependence of results may not be surprising, given the severity of the problems of omitted variables and inadequate measurement that confront such efforts. Third, school effectiveness indices are often unstable over time, a critical limitation in accountability systems that depend on measures of change. For example, Rowan and his colleagues (Rowan and Denk, 1983; Rowan et al., 1983) ranked 405 California schools on the basis of sixth-grade test scores after controlling for demographic variables and third-grade test scores. They classified the schools in the top quartile of adjusted scores as effective and then tracked their rankings over two additional years. The
To some extent, inconsistencies between portfolio and on-demand scores could reflect unreliability of scoring rather than substantive differences. This is particularly true in the 1992 assessment (reported in Gentile et al., 1995), when the agreement rate among portfolio scorers dropped substantially compared to 1990 (reported in Gentile, 1992).