Potential Strategies for Promoting Appropriate Test Use
Throughout this report, the committee has articulated principles of appropriate, nondiscriminatory use of tests for student tracking, promotion, and graduation. We have adopted a three-part framework (National Research Council, 1982) for determining whether a planned or actual test use is appropriate (see Chapter 1). We have also considered issues related to the participation of students with disabilities and English-language learners in large-scale assessments. More generally, we have recognized that high-stakes test use can produce both intended benefits and unintended negative consequences; policymakers need to be sensitive to both individual and collective benefits and costs of the different uses of tests, and they need to explore policy strategies that balance those benefits and costs.
But defining appropriate test uses and a means of identifying them is only a necessary condition: it is not sufficient to ensure that producers and users of tests will understand and follow these guidelines. The Congress recognized this fact when it asked the National Research Council, as part of this report, to consider "appropriate methods, practices and safeguards to ensure that … existing and new tests … are not used in a discriminatory manner or inappropriately …."
This chapter considers such potential methods, practices, and safeguards. The first section deals with the two existing monitoring and enforcement mechanisms: professional standards and legal action. These
approaches are important but inadequate. The second section, drawing in part on research and practice in fields other than testing, explores several other approaches that, coupled with existing mechanisms, may help ensure that tests with high stakes for students are used properly. These include deliberative forums, a test-monitoring body, better information about the content and purposes of particular tests, and increased government regulation. We also consider the criteria one might use in evaluating alternative approaches. In this discussion, we again maintain our focus on the uses of tests for high-stakes decisions about individual students, recognizing that other uses of tests can also have important indirect consequences on student learning.
The committee does not recommend any specific course of action or combination of strategies. Public officials have long recognized that achieving a policy goal often requires reliance on a variety of complementary strategies. Over the past decade, policy analysts have tried to make the logic of that conventional wisdom explicit and to help policymakers think more systematically about the range of strategies and tools they have available to address any given problem.
Policy design theory posits that public policies consist of goals or problems to be solved; target populations; agents and implementation structures; rules that specify responsibilities, resource levels, and time frames; tools that provide the motivation for targets and agents; and rationales that legitimate and explain the policy logic (Schneider, 1998; Schneider and Ingram, 1997). Once goals are set, tools are chosen to change people's behavior. The motivation for change may come from the allocation of resources, the threat of sanctions, or an appeal to deeply held values.
Rarely does a policy rely on a single strategy. Most embody multiple tools that reinforce each other. Although analysts can categorize generic policy tools (e.g., Schneider and Ingram, 1990, 1993; McDonnell and Elmore, 1987), the choice of an appropriate strategy depends on a given locale's needs, resources, and political culture, as well as its past experience with similar policies. Therefore, we offer no prescription except to argue that ensuring appropriate test use requires multiple strategies. Further research should yield more detailed evidence of their relative strengths and weaknesses in different settings.
Existing Methods, Practices, and Safeguards
The professional standards that govern testing are embodied in the Standards for Educational and Psychological Testing and the Code of Fair Testing Practices in Education. Joint committees of the American Psychological Association (APA), the American Educational Research Association (AERA), and the National Council on Measurement in Education (NCME) initiated work on the Code; other associations joined the effort later.
Although professional concerns and attempts to improve test quality date back to 1895, the first formal guidelines for test development were the 1954 Technical Recommendations for Psychological Tests and Diagnostic Techniques (American Psychological Association et al., 1954; Novick, 1982). A specific set of guidelines for achievement tests, Technical Recommendations for Achievement Tests, quickly followed and served to reinforce the 1954 guidelines (National Education Association, 1955). The major aim of both documents was to define standards for informing test users in judging the utility of a given test.
The 1954 standards focused on six critical areas: dissemination of information, interpretation, validity, reliability, administration, and scales and norms. The various standards were differentiated into three categories: essential, very desirable, and desirable. The document struck a balance in defining the uses and misuses of tests and called for self-regulation in the testing community: "Almost any test can be useful for some functions and in some situations. But even the best test can have damaging consequences if used inappropriately. Therefore, ultimate responsibility for improvement of testing rests on the shoulders of test users" (American Psychological Association et al., 1954:7). The 1954 standards were well received by both professional test developers and test users, and they exerted a significant influence on textbooks, test manuals, and research (Novick, 1982).
The 1954 standards were revised and superseded by the 1966 Standards for Educational and Psychological Tests and Manuals (American Psychological Association et al., 1966), which were revised again in 1974 and in 1985. Each revision reflected the expansion of testing and a growing concern among professionals about the effects of the uses of tests. The 1985 Joint Standards reflected the influence of testing on educational policy and included chapters on the uses of tests for minorities, students with disabilities, and others. In 1998, as this report is written, the Standards are being revised again.
In an attempt to make the Standards for Educational and Psychological Tests and Manuals accessible to a broader audience of test users, the APA, the AERA, and the NCME jointly developed the Code of Fair Testing Practices in Education. The principles outlined in the code are widely held by professionals as crucial to promoting fair use of tests (Office of Technology Assessment, 1992). All testing companies that were approached explicitly endorsed the Code.
At each stage of their evolution, the Joint Standards and the Code have been written in intentionally broad terms to reflect the array of acceptable professional practices that can be used to accomplish a single goal. But their effectiveness depends on professional judgment and good will. The Office of Technology Assessment (OTA) put it this way: "the test-taker's fate rests on the assumption that good testing practice has been upheld by both the test developer when it constructed the test and the test user (such as the school) when it selected, interpreted, and made a decision on the basis of the test" (1992:68).
Compliance with the Joint Standards is voluntary for members of the AERA and the NCME, who are mainly academics and researchers. These organizations encourage compliance, in accordance with their ethical guidelines, but they lack monitoring or enforcement procedures (American Educational Research Association, 1992; National Council on Measurement in Education, 1995). The APA does have standing policies for monitoring and enforcing its ethical principles. At most, however, violations appear to result in expulsion from the organization (American Psychological Association, 1981).
Perhaps because of the reliance on self-regulation for enforcing the Joint Standards, many critics have questioned whether these documents have in fact improved test use (Haney and Madaus, 1991; Kohn, 1977; Madaus et al., 1993; National Commission on Testing and Public Policy, 1990; Office of Technology Assessment, 1992). One of the problems with professional self-regulation is its inability to inform or influence people outside the testing profession. Many of the principal users of educational test results—school administrators, teachers, and policymakers—are unaware of the Standards and are "untrained in appropriate test use" (Office of Technology Assessment, 1992). Despite the best efforts of numerous professional associations, inappropriate test use continues to be
a serious problem. As a result, individuals and groups aggrieved by inappropriate practices have turned to the courts. Some people have questioned whether the Standards have any real enforcement capacity.
The National Commission on Testing and Public Policy found that "the most common way to challenge important tests is through the courts" (1990:21).
This is no accident. Federal constitutional provisions (see Chapter 3), federal legislation, and, in some instances, state laws and regulations provide some norms regarding proper use of educational tests. These norms can be enforced judicially and in, some cases, administratively. For several reasons, however, legal challenges are a highly imperfect mechanism for ensuring proper test use.
First, almost all the provisions of federal law that concern educational testing were designed to protect particular groups of students rather than the entire population. The Individuals with Disabilities Education Act (IDEA), for example, contains rules on testing students with disabilities. Title VI and Title IX regulations include provisions on tests that have disproportionate impact by race, national origin, or sex, but neither they nor the constitution's equal protection clause covers inappropriate test uses that affect all children equally. Moreover, federal civil rights protections are far less extensive or specific with respect to educational tests (under Titles VI and IX) than they are under Title VII, which covers employment testing. The Buckley Amendment is perhaps the only federal statute that covers the test scores of all students, and it protects only the confidentiality of student records. In short, federal law is a patchwork of rules rather than a coherent set of norms governing proper test use for tracking, promotion, or graduation.
Enforcement is similarly patchy, for several reasons. First, the law is not self-enforcing; students or parents generally must file complaints, either with administrative agencies or with courts, if they believe that school officials have violated legal norms governing test use. This can happen only if students and parents know their rights and how to enforce them. If the complaints lead to lawsuits, the students and parents must have the means to obtain legal representation. These are big if's.
Second, most court decisions are not binding everywhere. For example, in a leading constitutional case on competency testing, Debra P.
v. Turlington (1981), a federal appeals court ruled that "[s]tudents should have opportunities to learn the material on the tests in school [and that] [s]tudents should receive adequate notice to prepare for the tests" (Office of Technology Assessment, 1992:74). Although this decision has influenced test policy in many states and continues to do so, it is legally binding in only one region of the country. In some instances, courts in different jurisdictions face identical legal questions but reach opposite results (compare Larry P. v. Riles, 1984, with Parents in Action on Special Education v. Hannon, 1980, both of which are discussed in Chapter 3).
Third, courts vary in the degree of deference they give to the educators who are responsible for test policy and practice. Some courts pay careful attention to the testing standards: "The body of case law reveals some broad themes about how courts view tests, and some general principles about acceptable and unacceptable uses of tests. In general, courts have a great respect for well-constructed, standardized tests that are clearly tied to the curriculum" (Office of Technology Assessment, 1992:73–74).
On other occasions, however, courts have approved test uses inconsistent with the Joint Standards or the policies of the test maker. This was the case when a court sustained the use of fixed cutoff scores on the National Teacher Examination as the basis for certifying new teachers, even though the test developer, the Educational Testing Service, in an amicus brief, claimed that such use was improper ( United States v. South Carolina, 1977). In such cases, courts may defer too readily to the judgments of educators who do not know about the Joint Standards or choose to ignore them.
Fourth, when a legal challenge is mounted, "the test questions often can be seen only by expert witnesses, and testimony about their quality is given in secret. Many problems associated with such publicly funded tests thus may not become public, particularly if the court challenge is unsuccessful" (National Commission on Testing and Public Policy, 1990:21).
Last but not least, court challenges are expensive, divisive, and time-consuming for plaintiffs and defendants alike. "[E]ven when court challenges succeed and compensatory damages are awarded, the cases often drag on so long that opportunities for work and learning may be denied claimants for years" (National Commission on Testing and Public Policy, 1990:22).
For all these reasons, it is important to explore alternative mechanisms for regulating test use.
Alternative Policy Mechanisms
The two existing enforcement methods outlined above—professional standards and legal action—represent two ends of a continuum of institutional mechanisms for promoting appropriate test use. In this section, the committee explores possible alternatives that might represent intermediate points on that scale, in order of increasing degree of coerciveness. These options stem from the committee's review of literature from policy sciences and draw from analogies to other policy realms. As noted above, the committee does not recommend any one option or combination of options, but we offer these as possible alternatives that could be included in a mix of strategies. These options are worthy of consideration, in the committee's view, for two main reasons: (1) they are variants of mechanisms that have been applied to other policy problems that share some characteristics with the problem of test use (e.g., information asymmetry between producers and users of tests) and (2) there is an empirical literature on the theoretical and practical implications of these options.
Noting a decline in public trust in government and growing evidence that Americans are becoming disengaged from civic life (Putnam, 1995), some theorists have proposed a politics of deliberation as an alternative to the current interest-based politics (Gutmann and Thompson, 1996; Fishkin, 1991; Bickford, 1996). Deliberative politics assumes the primacy of talk—that is, reasoned argumentation, persuasion, and consensus building—in place of power and bargaining. In a deliberative model, access to the decision-making process is open and relatively cost-free; sufficient information is available so that participants can understand how proposals affect their interests and values. All participants have equal standing in the process, regardless of their resources or social status, and issues are considered on their merits, rather than on the balance of resources available to advocates and how they are bargained.
In a deliberative model, participants' preferences are not fixed, and they are expected to change over the course of the deliberations. Deliberation need not result in consensus or agreement on a particular decision. Rather, participants can reach a mutual understanding about their commonalties and differences. When decisions are reached, however,
participants are more likely to accept even outcomes with which they disagree because they feel that they have influenced the outcome.
This approach sounds attractive, but it has obvious problems. First, there are few actual examples of such an approach. The jury system is perhaps the most common deliberative forum in a public institution; New England town meetings are another example. Recently, those advocating deliberative approaches have begun to create forums in which serious public deliberation can occur. For example, as a result of a mandate by the Texas Public Utilities Commission that power companies must consult their customers, the utilities in that state have begun to use a form of deliberative polling, in which random samples of Texans meet for a weekend, learn about the issues related to energy production and conservation, and then discuss a range of options and trade-offs. The result has been that, after the process, participants expressed a greater willingness to pay more for energy efficiency and for renewable sources (The Economist, 1998). Similarly, Oregon used a series of citizen discussion sessions when the state was attempting to expand access to health insurance and had to make difficult decisions about how to balance cost, quality, and access (Marmor, 1998).
A second problem is that a deliberative process can lock in existing political, economic, and social inequalities unless extraordinary efforts are made to ensure that access is open, information is easily available, all participants with a stake in the outcome are represented, and their views are heard respectfully and considered seriously. Meeting these conditions is not impossible, but it is a tall order in a system that assumes political equality but is characterized by enduring inequalities in resources and skills.
Finally, deliberation requires time, patience, and skill. Yet many people are reluctant to invest them and lack the inclination to cultivate deliberative skills. Consequently, a deliberative model would be more likely to be effective if it were used only for those decisions that embody significant values over which there is substantial disagreement. These are contested rather than settled issues, which cannot easily be resolved by bureaucratic or expert authority. In addition, deliberation is unlikely to be successful unless there are institutional supports for maintaining and supporting the process and its outcomes. Deliberation will not be effective if it is conceived as a short-term, one-shot strategy.
Policymakers also need to recognize that deliberation is open-ended, and they may have little control over its direction or its results once the
process begins. The advantage for public officials, however, is that, if they take the results of a deliberative process seriously in policy decisions, the public will be more likely to accept the outcome, thus giving policymakers added legitimacy.
How Deliberative Forums Might Work in Testing
Decisions about what constitutes appropriate test use are typically made by test developers and policymakers. But they rarely talk to each other or explain their decisions to parents and the public. One strategy for bridging the gap among technical, policy, and public perspectives on test use would be to create deliberative forums in which all the various parties with a stake in assessments would be represented.1 The purpose would be to consider key questions related to test use from a variety of perspectives. For example, what constitutes "educational quality" and "achievement to high standards"? What is the appropriate role of testing in shaping and measuring progress toward those goals? Under what conditions should test scores be used in making decisions about individual students? How much error is acceptable if test scores are used in those decisions?
Such forums might be convened at the state or local levels; they could be held under governmental auspices or more informally by such organizations as the League of Women Voters or parent and community groups. These forums could be standing groups that are advisory to official policymaking bodies, such as state or local boards of education, or they might be special-purpose groups established when a state or district is considering a major change in its testing program or in how it intends to use test results. Use of a deliberative forum would need to be combined with other strategies that embody the authority to sanction test misuse. Nevertheless, deliberative forums could enhance the design of these other policy tools by ensuring that they reflect thoughtfully constructed, public preferences and by giving greater legitimacy to the policies that result. Establishing a deliberative forum acknowledges that testing is a process of political communication (Kettl, 1998), and, as such, debates over its use cannot be settled on technical criteria alone. Widespread interest in testing issues make this an auspicious time to consider the development of such forums.
Independent Oversight Body
The complexity of consumer products and the demands for information by consumers have led to the creation of independent organizations that provide reputable, sound information about the quality and limitations of consumer options. Most notable among these organizations are Consumers Union, which publishes the widely respected Consumer Reports, and Good Housekeeping Magazine, whose seal of approval buyers look for and manufacturers covet.
George Madaus and his colleagues have proposed the creation of an independent organization to monitor and audit high-stakes testing programs (Madaus et al., 1993, 1997):
Evaluating and monitoring testing programs does mean, however, that the public which pays for such programs and those that use and are directly affected by such tests should have assurances that the programs are technically sound, that the benefits outweigh harms for all groups in society, that negative side effects are minimized, and that misuses are curtailed (Madaus et al., 1993:3).
This proposal, which would reconstitute the National Commission on Testing and Public Policy (1990), is not intended to establish a regulatory body per se, nor is it aimed at awarding a seal of approval to particular programs. Rather, it is intended to improve test use by monitoring test programs.
The proposed commission would include experts from a variety of fields and representatives of test user groups. It would establish a standing technical panel, creating other panels as needed. The commission would conduct public forums, sponsor research, hold workshops for educators and policymakers, and disseminate information through a variety of media. The commission's evaluative judgments would be based on the Joint Standards as well as other criteria, applying them in the context of their use. The goal would be to offer formative assistance, encouraging test makers and users to improve their design and implementation as part of their professional practice.
The proposed commission could supplement the labeling approach described below by providing a forum for educating the profession and the public about testing practice. It could also serve as a deterrent to inappropriate practices by creating the prospect of adverse publicity (House, 1998).
Even in conjunction with other approaches, however, this proposal should be evaluated in terms of both its potential benefits and its potential shortcomings. First, although an oversight body can certainly identify
tests and test uses that are seriously flawed, there are other issues on which even testing experts disagree. Studies in which multiple expert panels independently evaluated the appropriateness of certain test practices would help to identify—and possibly to narrow—such differences. Second, there is no way to ensure that test publishers or school administrators would submit their programs for the commission's review. Last but not least, there is no guarantee that test users would abide by the commission's judgments. In many cases, political pressures to adopt highstakes testing programs could outweigh concerns about improper test use.
These problems could diminish over time, however. If the commission proved itself trustworthy, credible, and impartial, then publishers and administrators might find that the costs of ignoring its judgment were too high. Prospective users might question why a program had not been reviewed, and critics—and potential litigants—could hold up the commission's judgments as a tool in challenging inappropriate programs. We should reiterate that the concept of such an oversight body is not universally accepted or viewed as flawless; nonetheless, it is worthy of consideration in conjunction with other policy tools.
In a number of other domains, information has been used as a policy strategy. A variety of "right-to-know" policies provide information to the public about the health risks and benefits associated with various drugs, food products, and toxins. In the case of food labels, the overwhelming majority of consumers are not in a position to ascertain the nutritional value of the foods they buy. In contrast to other attributes, such as taste and freshness, nutritional content is an area in which sellers have considerably more information than buyers, thus violating the ideal of a perfect market (Caswell and Mojduszka, 1995). Similarly, consumers often lack sufficient information to make informed decisions about complex but infrequent purchases, such as refrigerators and air conditioners (Magat and Viscusi, 1992). In these instances, giving consumers information makes the market work more efficiently. In other cases, there is evidence that provision of information can reduce hazardous behavior and/or mitigate the harmful effects of such behavior, even without sanctions that are tied to the information. For example, in a recent study of workplace safety inspections, researchers found that inspections initiated by workers reduce injuries regardless of penalty, suggesting that information can be a critical factor in effecting changes in behavior (Scholz and Gray, 1997).
The assumption behind these policies is that disclosure will correct the information imbalance between producers and consumers, enabling people to make informed purchases and to participate on a more equitable basis in public decisions. This approach is viewed as a way to give individuals the resources to choose the risks and benefits they will accept, rather than leaving the decision to government regulators (Stenzel, 1991). Labeling is often accepted across the political spectrum because it involves considerably less governmental intervention than a solely regulatory strategy. For this policy model to operate as intended, however, individuals must seek and be able to understand information about potential risks and benefits, and they must have opportunities for choice of action in response to that information (Pease, 1991).
Even though information is the primary mechanism to motivate action, a mandate is often involved in these policies as well. The information required under a particular policy may not be voluntarily offered, because its dissemination runs counter to the interests of those who must produce the information. For example, some manufacturers might not voluntarily release nutritional information, because consumers might be less likely to buy prepared foods knowing they contain a lot of additives or fats. So policies that use information as their primary strategy typically mandate its production and dissemination. Prime examples are food labeling and community right-to-know statutes requiring that the presence of hazardous materials be publicly reported. 2
Data on the effectiveness of these policies are limited, and their track record is mixed. Perhaps the most visible use of this strategy has been the warning labels that cigarette manufacturers are required to place on their products and advertising. These messages have contributed to a reduction in smoking among Americans, although other factors have also
played a role. Other strategies, like an outright ban on cigarette sales to minors and prohibitions against smoking in public places, may in the long term be more effective, but labeling has been the most politically feasible and easiest approach to implement.
Some policies, such as nutritional labeling, target both producers and consumers. There is evidence, for example, that in response to labeling requirements, food manufacturers have reformulated their products to enhance nutritional value, and that better information leads consumers to change their buying habits (Caswell, 1992; Ippolito and Mathios, 1990).
Despite its potential as a relatively inexpensive, minimally intrusive policy strategy, labeling also has some significant disadvantages. Users can easily be inundated with data, with little context for interpreting it. Or the data can be presented in a confusing or inaccessible format.3
Other analysts have suggested a potentially more serious problem. In requiring the reporting of something as straightforward as a warning that a substance has been known to cause cancer, a disclosure policy may mislead the public into believing that there are simple answers to complex questions about assessing risk. In fact, most risk assessments are tentative; researchers do not fully understand, for example, the causal relationship between exposure to a substance and the incidence of cancer (Stenzel, 1991).
This same dilemma arises in the case of testing. In reducing student achievement to a single test score, there is always the danger that the public and parents will assume that this score encompasses the full measure of a student's or a school's performance. In essence, there is a tradeoff between making information understandable and accessible and ensuring that it can be validly interpreted.
A labeling strategy may also create incentives for selective disclosure and other attempts at "gaming" the system by those required to report information. Such problems could very well be exacerbated if there is no neutral, expert body to evaluate the accuracy of the information on labels.
Research on policies that rely on information and persuasive communication has focused on whether targets have a sufficient incentive to act. But there is also the question of whether they have the capacity to do what is expected of them. At one level, building the public's capacity to understand and act on the information provided can be accomplished with careful attention to the quality of that information. Using experimental and survey data, Magat and Viscusi (1992) examined various approaches to informing people about hazardous materials. They concluded that, to succeed, such policies need to take into account the specific context in which they are operating, because people are likely either to under- or overreact to information that does not also communicate the size and nature of the risk. As more information about risk is provided, they found, people tend to process and recall less information about other important aspects of a product, such as its proper use. They also found that consumers are more likely to understand and act upon information that is provided in a standardized format so they can make comparative judgments across products.
Consumers may require more information than is provided directly by the labeling policy. Consumers cannot use nutritional profiles to advantage, for example, if they are unfamiliar with the building blocks of a healthy diet or do not know how to prepare nutritious meals. Similarly, parents cannot participate effectively in educational decisions based on their children's test scores if they lack information about the available alternatives and how they fit with their children's abilities, needs, and interests. Building people's capacity and willingness to act depends on more than just disseminating information; it requires a long-term investment in learning and support.
How Labeling Might Work in Testing
It is important to clarify both the party or parties responsible for providing the information (label) and the targets of the labeling strategy. There would be two main targets of a labeling strategy. First, test developers and producers might be required to report to test users, such as public officials and educational administrators at the state and local levels, on the appropriate uses of their tests. Currently, most major test publishers (e.g., Harcourt Brace, Riverside, CTB/McGraw Hill) and nonprofit testing organizations voluntarily publish guides that describe the appropriate uses and limits of their products. But these guides, as well as
technical manuals commonly provided by test publishers, are not widely available to the public, and the manuals are quite costly if purchased independently of the full testing package.
For example, the Iowa Test of Basic Skills (ITBS) guide for administrators lists the following uses of test results as inappropriate when decisions are based solely on a test score: screening children for school enrollment, retaining students at a grade level, and selecting students for special instructional programs. As we have seen with Chicago's promotion policy, however, those with responsibility for test policy may choose to ignore the warnings of test publishers. Not only do the publishers have little recourse in such instances, but they also know that if they refuse to sell their products to those using them inappropriately, test users will simply take their business elsewhere.
As with professional standards, labeling aimed at test users would be designed to appeal to their sense of appropriate teaching and learning for the students in their care. But professional values are not always a clear guide to practice and must often be applied in light of conflicting values—such as the tension between the collective goals of public accountability and individual student needs.
One of the reasons that test users can ignore publishers' warnings is that little information about what constitutes appropriate test use ever gets to their constituents—parents, the public, and the media. Therefore, a second target of a labeling strategy might be test consumers. In such a strategy, policymakers and education officials would be required to report to parents, the public, and the media on whatever tests they chose to administer. The following kinds of information might be required about all high-stakes tests:
- the purpose of the test;
- how individual test results will be used;
- whether they will be the sole basis for a particular decision or if other indicators will be used;
- the immediate consequences of this test use for individual students, such as whether poor performance on a test will automatically result in a particular placement or treatment;
- whether the test has been validated for the purpose(s) for which it is being used and by whom;
- some indication of the degree of consistency between what is being tested and what is taught;
- a brief description of the options available to parents who want more information or who question decisions based on a test score. A toll-free telephone number could also be included.
Because the primary targets of this information would be parents, it would need to be concise, jargon-free, written in the languages that parents read, and easily understandable to noneducators. This information would be provided in a variety of formats, both well before a test is administered and when scores were reported to parents. Other direct users might include the news media, which could include such information when they reported test results.
A requirement that information be reported to parents could be based on federal (as part of the Title I and IDEA testing requirements) and state legislation. Policymakers would probably have a variety of incentives for requiring that this information be reported. It would be a minimally intrusive way to address concerns about appropriate test use and should therefore appeal to those who eschew a strong regulatory role for government. Yet a strategy of informing the public would not preclude—and in fact could trigger—the use of other kinds of enforcement mechanisms, so it would probably also appeal to those who want stronger policy levers but who see the advantages of having a range of options. In addition, this strategy would be relatively inexpensive, although not without administrative and other kinds of transaction costs (e.g., in responding to public and parental concerns).
This kind of strategy could also affect other targets besides parents and the media. For example, local schools would be likely to be more attentive to opportunity-to-learn issues if they knew that information about the consistency between testing and teaching would be publicly disseminated. Similarly, test developers and publishers would be likely to be more responsible in promoting their tests, because an information reporting policy would solve a major collective action problem for them: everyone would have to report the same information and be subject to public scrutiny about the veracity of that information, so there would be little incentive to make exaggerated claims about a particular test.
The Joint Standards and the Code of Fair Testing Practices would be used to frame a test labeling policy. Responsibility for implementing such a policy would rest with a publicly accountable institution. That requirement would not, however, preclude an agency operating under governmental authority from delegating responsibility to others. Some tasks,
such as designing the reporting form and the measures used and then verifying the information provided by testing agencies and contractors, could be performed by third parties, such as the proposed National Commission on Testing and Public Policy; the National Center for Evaluation, Standards, and Student Testing; the Center for Research on the Education of Students Placed at Risk; the Center for Research on Education, Diversity and Excellence; universities; and nonprofit organizations.
There are certainly differences between the labeling of tests and labeling in other spheres, such as nutrition. For example, some people might argue that information on food ingredients is more factual than information about tests. The committee notes, however, that competing claims surrounding nutrition and its relation to health status are quite numerous. Nutritional labeling has not resolved those competing claims, but it has raised public awareness, pushed advocates toward more evidence-based arguments, and led to more research.
There are at least two major shortcomings that could impede the effectiveness of the labeling strategy in testing. First, it could be an insufficient resource for those who most need it (poor parents with few political resources), and it would be unlikely to curb the most serious cases of inappropriate test use. Moreover, labeling tests and test results may miss the mark entirely if tests are merely reflecting accurately the fact that students have not acquired skills and knowledge because they have not received an adequate education (Schneider, 1998; Levin, 1998). For these reasons, this strategy would probably not be useful except in combination with others. Aggrieved parties would still have recourse to the courts, and efforts to equalize financial and political resources (e.g., school finance equalization, enforcement of parental rights in special education) would need to continue. Nevertheless, this strategy could significantly redress the information imbalance that now exists in testing, and it could serve as a critical mobilizing resource for those concerned about just treatment for all students.
Perhaps the most powerful tool for promoting and ensuring appropriate test use is federal regulation. Although the federal government provides a small fraction of the $300 billion the United States spends annually for precollege education, and states are constitutionally responsible to provide public education, the federal government, particularly in the
past 30 years, has played a significant role in educational practice nationwide. By making federal aid contingent on the adoption of particular practices, the federal government can exert a substantial influence on practice in virtually every school district in the United States.
The use of federal regulation as a means of promoting appropriate practice in other realms is widespread. Consider traffic safety. Like education, highways are primarily a state responsibility, with the federal government providing financial support for interstate roads. That leverage, however, allows the federal government to set rules for practice—such as the national speed limit of 55 miles per hour—on all highways. (The limit was repealed in 1996.) The federal government used similar methods to raise the drinking age in all states to 21.
One possible source of regulation of test practice is Title I, the largest federal effort in K-12 education. Created in 1965, when the federal government first agreed to provide aid to elementary and secondary schools, the program was designed to "level the playing field" for disadvantaged students by providing financial assistance that would compensate such students for the advantages their peers from more affluent families enjoyed. With a current annual budget of approximately $8 billion—one-fourth of the U.S. Department of Education's total budget—the program reaches more than 6 million students in three-fourths of all elementary schools and half of all secondary schools.
Despite its relatively modest share of the education budgets in the 50 states, Title I has exerted a powerful influence on schools and school districts throughout the country. This is particularly true in the area of testing. From its inception, Title I required the use of "appropriate objective measures of educational achievement" to ensure that the program was meeting its goal of reducing the gap between low-income and higher-income students. In carrying out this requirement, states and school districts typically used standardized, norm-referenced tests both to determine eligibility and to measure gains. As a result, Title I increased dramatically the number of tests that states and districts administered (Office of Technology Assessment, 1992).
The Congress revamped Title I substantially in 1994, and perhaps the most far-reaching changes concerned assessment. The 1994 law eliminated the requirement for a separate testing program for Title I students. Instead, Title I testing was integrated into state systems aimed at holding all students—including those eligible for the federal aid program—to high standards of performance. To that end, the law required states to
develop both challenging standards for student performance and assessments that measure student performance against those standards. Significantly, the current law states that the standards and assessments should be the same for all students, regardless of whether they are eligible for Title I.
Other possible sources of regulation are federal civil rights statutes such as Title VI and Title IX.
How Regulation Might Work in Testing
Can regulations under Title I or other federal statutes serve as regulatory monitors to help ensure appropriate test use? In some respects, the 1994 Title I statute and regulations already serve that objective. For one thing, the law now makes schools and school districts, rather than students, the unit of accountability. As a result, current law removes any incentives that previous versions of Title I may have provided for states to administer tests that have high stakes for individual students.
In addition, the law helps ensure appropriate test use by requiring multiple measures of performance. U.S. Department of Education guidelines recommend that "different approaches and formats be used in the assessment system. Examples include criterion-referenced tests, multiple choice tests, writing samples, completion of graphic representations, standardized tests, observation checklists, performance of exemplary tasks, performance events, and portfolios of student work" (U.S. Department of Education, 1997:25). This provision helps ensure that a single measure is not used to make decisions about individual students, schools, or school districts.
The statute also includes provisions that promote fair test use. The law requires "reasonable adaptations and accommodations for students with diverse learning needs," and the inclusion of English-language learners "to the extent practical, in the language and form most likely to yield accurate and reliable information on what they know and can do, to determine their mastery of skills in subjects other than English." Thus Title I, properly implemented, helps ensure that students with disabilities and English-language learners participate in large-scale assessment programs, and that such students are assessed in ways that are valid.
Despite these important safeguards, Title I is silent with respect to many tests that states and school districts use. They could therefore use those tests inappropriately even while complying with the federal law.
For example, to comply with Title I a state could submit a plan under which students in 4th, 8th, and 10th grades would take standardized tests and a writing assessment and would submit portfolios, to determine school and district progress in enabling students to reach state standards. At the same time, however, a school district in that state could administer a test on the basis of which students would automatically be retained in grade—an inappropriate practice under current standards of the testing profession.
The objective of assessment under Title I—holding all students to challenging standards for performance—can be undermined by improper use of tests for student tracking, promotion, or graduation. To guard against such a situation, Title I regulations could be revised to ensure that all large-scale assessments administered within a state complied with the Standards for Educational and Psychological Testing (American Educational Research Association et al., 1985) and the Code of Fair Testing Practices in Educational Tests (Joint Committee on Testing Practices, 1988). State Title I plans could address the extent to which state and local assessment systems met these professional norms.
There are other federal statutes whose regulations could be amended to include—or to reference—standards of appropriate test use. These include federal civil rights statutes, such as Title VI of the Civil Rights Act of 1964 and Title IX of the Education Amendments of 1972.
Title VI and Title IX prohibit recipients of federal funds from discriminating on the basis of race, national origin, or sex, and most disputes about testing have involved tests that carry high stakes for students. Under existing regulations, when an educational test has disproportionate adverse impact by race, national origin, or sex, the federal aid recipient responsible for the test and its use must demonstrate that the test and its use are an educational necessity (see Chapter 3). Federal regulations do not, however, define what an educational necessity is. As a result, there has been uncertainty about how to apply these rules, whether administratively or judicially.
Thus, a possible use of federal regulation to promote proper test use would involve defining educational necessity in terms of compliance with the Joint Standards and the Code of Fair Testing Practices. A high-stakes test use inconsistent with relevant provisions of the Joint Standards and the Code would not be considered educationally necessary.
Using federal regulations in this way would offer certain advantages. Most important, the regulations apply to all 50 states and nearly all
school districts, because they receive federal funds. Policymakers and administrators are understandably reluctant to jeopardize this funding. Historically, Title I, Title VI, and Title IX have had an important influence on test policy and practice in the country.
In addition, federal regulations could provide a powerful tool for educating policymakers and the public about appropriate test use. Through conferences, technical assistance centers, and handbooks and newsletters, the U.S. Department of Education provides a wealth of information and support to states and districts about federal law and its requirements. Including the principles of appropriate test use in federal regulations would result in their wide dissemination and would make educators and the public much more aware of the potential risks of inappropriate practices. In conjunction with the labeling and deliberative functions described above, the regulatory approach could significantly enhance the information available to educators and the public about appropriate test use.
Relying on federal regulation—under Title I, Title VI, Title IX, or other statutes—would also make use of existing mechanisms, administrative and judicial, to enforce standards of the testing profession that, for reasons discussed above, often go unenforced. Combining professionally developed norms with existing enforcement mechanisms could help address some of the principal weaknesses of each approach: professional norms would become more enforceable, and federal authorities (administrative agencies and judges) would not need to create their own definitions of appropriate test use.
The regulatory approach also entails significant risks, however. It is a blunt instrument, and the sanctions available for failure to comply—the cutoff of federal aid—often make it unwieldy. This is because the federal government loses its leverage when it applies the penalty. The Office of Juvenile Justice and Delinquency Prevention (OJJDP) discovered this problem when it attempted to develop state mandates regarding the deinstitutionalization of status offenders (minors who commit acts that are not crimes for adults, such as underage drinking). The mandates applied to states that accepted grant funds, but if states rejected the funds, they did not have to comply with the mandates. States then put pressure on the Congress to provide the funds without requiring them to comply with the mandates, and the OJJDP did not object strongly, because it wanted the states to continue to participate in the program (Schneider, 1998).
Moreover, using federal regulation to promote proper test use would be subject to many of the usual disadvantages of administrative and judicial enforcement, as well as to disadvantages stemming from any ambiguities in the Joint Standards and the Code. These issues are likely to become formidable barriers to increased regulation, especially in today's climate that favors "devolution" and a reduced federal role.
Perhaps the most serious risk involved in the regulatory approach is the possibility of a backlash against all federal regulation, which would make it more difficult for the the U.S. Department of Education to guide local practice. Disenchantment with regulation generally, as well as specific instances in which test regulation has produced unintended consequences, would need to be considered in developing effective regulatory strategies. Despite the goal of the Congress and the Department of Education to permit maximum local flexibility under federal rules, some members of Congress already consider federal law too prescriptive and want to lighten the hand of Washington over local education policy. Proposals to convert federal education aid to block grants, which states could use as they see fit rather than following federal guidelines, reflect such concerns. A regulatory approach that was seen as infringing on local prerogatives could strengthen support for such plans, and it could ultimately restrict the federal government's capacity to influence state and local practice on testing and many other issues.
Deliberative forums, an independent oversight body, labeling, and federal regulation represent a range of possible options that could supplement professional standards and litigation as means of promoting and enforcing appropriate test use. The committee is not recommending adoption of any particular strategy or combination of strategies, nor does it suggest that these four approaches are the only possibilities. We do think, however, that ensuring appropriate test use will require multiple strategies.
Given the inadequacy of current methods, practices, and safeguards, there should be further research on these and other policy options to illuminate their possible effects on test use. In particular, we encourage empirical research on the effects of these strategies, individually and in combination, on products and practice and an examination of the associated potential benefits and risks.
American Educational Research Association 1992 Ethical standards of the American Educational Research Association. Educational Researcher 21(7):23–26.
American Educational Research Association, American Psychological Association, and National Council on Measurement in Education 1985 Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association.
1998 Draft Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association.
American Psychological Association 1981 Ethical principles of psychologists. American Psychologist 36:633–638.
American Psychological Association, American Educational Research Association, and National Council on Measurement in Education 1954 Technical Recommendations for Psychological Tests and Diagnostic Techniques. Washington, DC: American Psychological Association.
1966 Standards for Educational and Psychological Tests and Manuals. Washington, DC: American Psychological Association.
1974 Standards for Educational and Psychological Tests and Manuals. Washington, DC: American Psychological Association.
1985 Standards for Educational and Psychological Tests and Manuals. Washington, DC: American Psychological Association.
Bickford, S. 1996 The Dissonance of Democracy. Ithaca, NY: Cornell University Press.
Black, E.G. 1989 California's community right-to-know. Ecology Law Quarterly 16:1021–1064.
Caswell, J.A. 1992 Current information levels on food labels. American Journal of Agricultural Economics 74(5):1196–1201.
Caswell, J.A., and E.M. Mojduszka 1995 Using informational labeling to influence the market for quality in food products. American Journal of Agricultural Economics 78:1248–1253.
Fishkin, J.S. 1991 Democracy and Deliberation: New Directions for Democratic Reform. New Haven, CT: Yale University Press.
Gutmann, A., and D. Thompson 1996 Democracy and Disagreement. Cambridge, MA: Belknap Press of Harvard University Press.
Haney, W., and G. Madaus 1991 The evolution of ethical and technical standards for testing. In Advances in Educational and Psychological Testing: Theory and Applications , R.K. Hambleton, and J.N. Zaal, eds. Boston: Kluwer.
House, E.R. 1998 Preventing Test Abuse. Memorandum to the Committee on Appropriate Test Use.
Ippolito, P.M., and A.D. Mathios 1990 Information, advertising and health choices: A study of the cereal market. RAND Journal of Economics 21(3):459–480.
Joint Committee on Testing Practices 1988 Code of Fair Testing Practices in Education. Washington, DC: National Council on Measurement in Education.
Kettl, D.F. 1998 Uses of Educational Tests. Memorandum to the Committee on Appropriate Test Use.
Kohn, S.D. 1977 The numbers game: How the testing industry operates. Pp. 158–182 in The Myth of Measurability, P.L. Houts, ed. New York: Hart.
Levin, H.M. 1998 Design and Use of Educational Tests. Memorandum to the Committee on Appropriate Test Use.
Madaus, G.F. 1992 An independent auditing mechanism for testing. Educational Measurement: Issues and Practices 11(1):26–31.
Madaus, G.F., W. Haney, K.B. Newton, and A.E. Kreitzer 1993 A Proposal for a Monitoring Body for Tests Used in Public Policy. Boston: Center for the Study of Testing, Evaluation, and Public Policy.
1997 A Proposal to Reconstitute the National Commission on Testing and Public Policy As An Independent, Monitoring Agency for Educational Testing. Boston: Center for the Study of Testing Evaluation and Educational Policy.
Magat, W.A., and W.K. Viscusi 1992 Informational Approaches to Regulation. Cambridge, MA: MIT Press.
Marmor, T. 1998 Policy Instruments for Testing in Schools. Memorandum to the Committee on Appropriate Test Use.
McDonnell, L.M., and R. Elmore 1987 Getting the job done: Alternative policy instruments. Educational Evaluation and Policy Analysis 9(2):133–152.
National Commission on Testing and Public Policy 1990 From GATEKEEPER to GATEWAY: Transforming Testing in America. Boston National Commission on Testing and Public Policy.
National Council on Measurement in Education 1995 Code of Professional Responsibilities in Educational Measurement. Washington, DC: National Council on Measurement in Education.
National Education Association 1955 Technical Recommendations for Achievement Tests. Washington, DC: National Education Association.
National Research Council 1982 Placing Children in Special Education: A Strategy for Equity, K.A. Heller, W.H. Holtzman, and S. Messick, eds. Committee on Child Development Research and Public Policy. Washington, DC: National Academy Press.
1997 Educating One and All: Students with Disabilities and Standards-Based Reform, L.M. McDonnell, M.J. McLaughlin, and Patricia Morison, eds. Committee on Goals 2000 and the Inclusion of Students with Disabilities, Board on Testing and Assessment. Washington, DC: National Academy Press.
Novick, M. 1982 Ability testing: Federal guidelines and professional standards. In Ability Testing: Uses, Consequences, and Controversies. Part II: Documentation Section, A.K. Wigdor, and W.R. Garner, eds. National Research Council. Washington, DC: National Academy Press.
Office of Technology Assessment 1992 Testing in American Schools: Asking the Right Questions. Washington, DC: U.S. Government Printing Office.
Pease, W.S. 1991 Chemical hazards and the public's right to know: How effective is California's Proposition 65. Environment 33(10):12–20.
Putnam, R.D. 1995 Bowling alone: America's declining social capital. Journal of Democracy 6(1):65–78.
Schneider, A.L. 1990 The behavioral assumptions of policy tools. Journal of Politics 52(2):511–529.
1993 Social construction and target populations: Implications for politics and policy. American Political Science Review 87(2):334–347.
1997 Policy Design for Democracy. Lawrence, KS: University Press of Kansas.
1998 Policy Tools for Addressing the (Mis)use of Standardized Tests. Memorandum to the Committee on Appropriate Test Use.
Schneider, A.L., and H. Ingram 1990 The behavioral assumptions of policy tools. Journal of Politics 52(2):511–529.
1993 Social constructions and target populations: Implications for politics and policy. American Political Science Review 87:334–347.
1997 Policy Design for Democracy. Lawrence, KS: University Press of Kansas.
Scholz, J.T., and W.B. Gray 1997 Can government facilitate cooperation? An informational model of OSHA enforcement. American Journal of Political Science 41(3):693–717.
Stenzel, P.L. 1991 Right-to-know provisions of California's Proposition 65: The naivete of the Delaney Clause revisited. Harvard Environmental Law Review 15:493–527.
The Economist 1998 Democracy in Texas: The frontier spirit. 31(May 16).
U.S. Department of Education 1997 Guidance on Standards, Assessments, and Accountability. Washington, DC: U.S. Department of Education.
Debra P. v. Turlington, 474 F. Supp. 244 (M.D. Fla. 1979); aff'd in part and rev'd in part, 644 F.2d 397 (5th Cir. 1981); rem'd, 564 F. Supp. 177 (M.D. Fla. 1983); aff'd, 730 F.2d 1405 (11th Cir. 1984).
Individuals with Disabilities Education Act, 20 U.S.C. section 1401 et seq.
Larry P. v. Riles, 495 F. Supp. 926 (N.D. Cal. 1979); aff'd, 793 F.2d 969 (9th Cir. 1984).
Parents in Action on Special Education v. Hannon, 506 F. Supp. 831 (N.D. Ill. 1980).
Title I, Elementary and Secondary Education Act, 20 U.S.C. sections 6301 et seq.
Title VI, Civil Rights Act of 1964, 42 U.S.C. sections 2000d et seq.
Title VII, Civil Rights Act of 1964, 42 U.S.C. sections 2000e et seq.
Title IX, Education Amendments of 1972, 20 U.S.C. sections 1681 et seq.
United States v. South Carolina, 445 F. Supp. 1094 (D.S.C. 1977); aff'd per curiam sub nom. National Education Ass'n v. South Carolina , 434 U.S. 1026 (1978).