Judith Gueron (member, steering committee) began the discussion on advancing high-quality evaluation by listing the six key components: protecting scientific quality, producing useful results, transparency, independence, ethical standards, and funding. The keys to protecting scientific quality, she said, are having a strong evaluation team, a strong design that addresses the appropriate questions, and review procedures that protect against false claims and reinforce credibility.
Rebecca Maynard (member, steering committee) stressed the importance of agencies “taking ownership” of an evaluation and fully understanding its purpose in order to define the appropriate strategy. An evaluation may be descriptive, causal, or a measure of change, and each should be approached from a different point of view. She also noted the importance of weighing the net cost of a study against its overall effectiveness, listing this as another dimension of quality. Demetra Nightingale (Urban Institute) reiterated that the principles and practices need to allow for the flexibility to adapt guidelines and strategies as needed depending on the study.
Naomi Goldstein (Administration for Children and Families) said she was struck by the differences she heard between peer review methods of the Institute of Education Sciences (IES), which screen studies prior to release, and methods of the Millennium Challenge Corporation, which encourage extensive peer review while still promoting release of all studies. She can see value in each approach, as long as they are carried out in a climate in which the need for high quality is valued and understood. Bethanne Barnes (Washington State Institute for Public Policy) agreed that there is a place for postrelease reviews in the discussion of scientific quality, and she
noted that the Office of Management and Budget’s clearinghouses conduct quality reviews of both federal and nonfederal documents. If an individual study is incorporated into a larger portfolio of work, she said, a back-end review will allow for consideration of where the study fits into the bigger picture. Clinton Brass (Congressional Research Service) raised the question of whether the definition of “scientific quality” is at all relative to the nature of the policy area, the research question being asked, or the intended use of the results.
Howard Rolston (member, steering committee), referring to Gueron’s comment about the importance of design, said that there is tension between performance management and evaluation: performance management typically pays less attention to design and makes causal claims without any explicit identification of a counterfactual, which is a real issue in the use of administrative data and would not be tolerated in high-quality evaluation. He also mentioned that there are new risks presented by growing access to administrative data, as well as a challenge to find complementary evaluation designs for these kinds of datasets.
Considering that practitioners and politicians are sometimes interested in more than just the bottom line, Gueron asked to what extent evaluations, methods, and reports should include interpretations that go beyond the highest standard of rigor in order to produce even more relevant and useful results—providing insight on resource allocation, areas for program improvement, and so on—while still protecting scientific quality. She suggested that studies could distinguish results that are relatively definitive from the less conclusive results that are suggested by a pattern of findings. Jean Grossman (Princeton University and MDRC) agreed that it is not entirely fair to taxpayers if evaluators only report on the findings that are irrefutable, when the often substantial data collection and analysis efforts have generated other potential insights. Furthermore, she said that in most cases evaluators would appreciate being given the leeway to explore the mechanisms behind results—as long as they can differentiate between which aspects of an evaluation are confirmatory and which are explanatory.
Russ Whitehurst (chair, steering committee) agreed with Grossman about the usefulness of supplemental analyses and interpretations but countered that some stakeholders expect answers to specific questions, and including findings on nonessential questions in a report may expose an agency to unintended political backlash. In the early stages of IES, Whitehurst said, the Secretary of Education and other officials were eager to receive scientifically based evidence and results to justify programs like No Child Left Behind, often requesting the results before the evaluation was completed or before the study had any evidence to provide. As a solution, Whitehurst said he and his staff established practice guides, which
made statements about broader topics and graded the evidence in terms of strength, type, and source (such as expert panels).
Gueron then asked the participants how they might handle a report in which the findings vary across outcomes, subgroups, time periods, and program settings (although perhaps not to a point of statistical significance) or deviate greatly from the expected results: Should one exclude a unique finding from a report if it was not part of the initial design? Goldstein responded that one should proceed with caution. It is important to highlight these kinds of differences, give the necessary caveats, and move forward to new research questions.
Miron Straf (Virginia Polytechnic Institute and State University) said that the key is often in the differences in program implementation. He said that he believes agencies should encourage exploration and not constrain the evaluators by forcing them to stick too closely to an initial protocol. He asserted that moving towards such a process of continuous improvement and experimentation will allow the flexibility to really learn what works. Maynard said the best approach would be to deliver two reports: a primary report answering the impact questions and providing an explanation of the methodology and a supplemental (possibly lengthier) report on other noteworthy issues and exploratory work. In this way it would be clear to the stakeholder that the focus has shifted to a different level of evaluation or rigor of evidence.
Gueron turned to transparency. She said that clarity, full disclosure, and careful timing were all key to convincing audiences of the credibility of an evaluation and ensuring neutrality in presentation. She asked the participants to weigh in on the pressures of timing when balanced against a desire to release complete results and whether or not either of those factors could threaten future funding or lead to undue political interference. Whitehurst emphasized the need for schedules for each component—contractors, peer reviewers, professional staff, and so on—that are appropriate to the context. A project should have neither too tight nor too distant a deadline, while still allowing a cushion. He also said it is important to make evaluation data available for secondary analysis.
Rolston said that the practice of registering studies also helps to enhance transparency. Maynard agreed that registering studies and laying out standards and expectations about evaluation methods and reporting can contribute to a smoother process. Both Rolston and Maynard noted that there have been improvements in the field in this regard. Evan Mayo-Wilson (Johns Hopkins University) wondered if participants thought there could ever be a registration mechanism for health behavior and labor studies similar to that in medical trials. Maynard reported that with support from IES, the Society for Research on Educational Effectiveness is sup-
porting development of a platform for registering causal inference studies, which is scheduled to launch in fall 2017.
Gueron next asked the group how federal agencies can reinforce independence in evaluations and protection from pressure to bias the selection of contractors or the reporting of results while also balancing their responsibility for the study and the need to gain credibility. How much flexibility should contractors have in conducting analyses, quality control, and report dissemination? Can reliance on a technical-only review protect contractors from pressure or inclination to spin the results? Mark Shroder (Department of Housing and Urban Development [HUD]) noted that for HUD, the threat of “bias” is sometimes introduced by a requirement that the agency has to pick a small business contractor over a larger company, which automatically rules out several qualified evaluators. Brass cited Eleanor Chelimsky (2008): in some circumstances there may be a tradeoff between an evaluator’s independence and an agency’s capacity to evaluate and learn: for example, it may be important for a mission-oriented unit to evaluate itself and take ownership of its learning agenda.
Gueron then inquired about how far the concept of independence extends. Is a contractor seen as an extension of the agency? Does it undercut contractors’ credibility if they do work for an agency seen as partisan? With that in mind, how does one attract the best people to do evaluation work? Whitehurst suggested that design competitions—in which the focus is not what work will be done, but how it will be done—can address this issue. Barnes, Rolston, and Ruth Neild (Institute of Education Sciences) all agreed that independence between federal agencies and contractors should not be viewed as an either/or situation; instead, it should be seen as a relationship that has to be managed throughout each project. Neild reiterated the utility of peer review technical working groups to mitigate the risk of bias.
Barnes reminded participants that unlike larger agencies with standalone evaluation agencies or offices, smaller agencies may have to handle evaluations for their specific program area. That difference in structure has important implications for how a program manages independence and works to ensure scientific integrity. Lauren Supplee (Child Trends) commented on how difficult it was to nail down a specific standard for “evaluator independence” during a high-stakes review she coordinated: To whom does it pertain? What if a critical person has multiple roles (a funder also being a program supporter, for example)? Her team’s solution was to always report the scenarios in full and allow the user to come to its own conclusion. Supplee added that these discussions of standards could also be valuable in the academic community.
In considering ethical standards, Gueron said she has learned over time that there does not need to be a tradeoff between rigor and ethics, and she
noted the critique that random assignment, while rigorous, is too demanding in certain contexts. She cautioned about ruling out random assignment too quickly and said that it is critical to be able to build a defense against ethical objections that may arise. Maynard commented that if an agency is proposing to use a method other than a randomized controlled trial to answer questions about impact or effectiveness, it should have a compelling argument as to why randomization cannot or should not be used. She said that the first thing one should do is gather information about what stakeholders believe would be challenging or unethical about randomization and what they view as the preferred alternative—which Gueron noted is often the hardest issue to counter—and then to systematically address the concerns, including the evaluation threats associated with the alternative. Christina Yancey (Department of Labor) pointed out that, at times, the most rigorous method (randomized controlled trials, for example) can overlook small or hard-to-sample populations. In these instances, she believes that the ethical approach is to still study these groups to have information on them, even if the data obtained do not meet a certain scientific standard.
Shroder raised an ethical problem: although IES, the Census Bureau, and the Internal Revenue Service have special regulations safeguarding the use of their data, many evaluation agencies do not have similar protections. Most evaluation agencies are not protected. He added that since the Freedom of Information Act often takes precedence over the Privacy Act, if a federal judge does not hold that a federal agency has shown a probability that the identity of individuals will be disclosed, the information in question must be disclosed. Whitehurst agreed on the importance of this issue.
Turning to funding, Gueron asked: If obtaining adequate funding for evaluation is so critical, are there ways to implement evaluation policies and practices that guard against political pressures tied to funding? Constance Citro (Committee on National Statistics) reiterated the need for qualified staff. She explained that the financial issue often extends to hiring caps for staff. She said it might benefit smaller agencies to learn from larger agencies that have had success with their evaluation policies and practices. Goldstein said that even when interest is high, acquiring high-quality staff within the constraints of the federal hiring system can be difficult. However, she noted, mobility of federal employees becoming contractors and vice versa sometimes aids with congruency. Neild and Nightingale added that certain staff gravitate to more hands-on work: keeping those staff engaged and encouraging them (particularly those coming from academia) to continue pursuing their research once they become federal employees can help agencies strike a balance when competing with contractors or academia to hire individuals with the needed technical qualifications.
This page intentionally left blank.