Christine Fortunato (Administration for Children and Families [ACF]) followed Connie Citro’s introductory remarks by emphasizing the importance of the workshop, stating that developing an infrastructure to guide federal evaluations and support high-level principles across offices and administrations has been a long-term goal of several federal agencies. She said that such an infrastructure would also help ensure that the programs are conducted and the results disseminated without bias or undue influence. Fortunato said that the federal government has taken several steps to foster the credibility of scientific evidence, including Statistical Policy Directive 1 from the U.S. Office of Management and Budget (OMB),1 the Information Quality Act,2 and the creation of specific evaluation policy statements by several federal agencies. Several of the agencies’ documents prescribe the core principles of rigor, relevance, transparency, independence, and ethics, which she explained would be focal points of the workshop discussion.
Steering committee chair Grover “Russ” Whitehurst stressed how essential it is for the federal government to have a strong evaluation effort, marked by rigor and independence, enabling agencies to provide accurate
1Fundamental Responsibilities of Federal Statistical Agencies and Recognized Statistical Units. Available: https://www.federalregister.gov/documents/2014/12/02/2014-28326/statistical-policy-directive-no-1-fundamental-responsibilities-of-federal-statistical-agencies-and [May 2017].
2 Section 515 of the Consolidated Appropriations Act, 2001 (Pub. L. 106–554).
and timely information to decision makers. Using an anecdote about an early federal evaluation in which he was involved, Whitehurst described how the results of federal evaluation are often received with trepidation and even embarrassment when they question the effectiveness of a program or refute the desired outcome or popular choice. In 2005, a randomized controlled trial conducted on the 21st Century Community Learning Centers After School Program found that the program did not improve participants’ academic skills and actually increased their misbehavior, e.g., being suspended from school. This news came much to the chagrin of then-Senator Arlen Specter, then-governor Arnold Schwarzenegger, and a host of other program grantors, community members, and advocates. They questioned the quality and relevance of the evaluation rather than accepting the importance of its findings to making decisions about the future direction of the program.
Whitehurst said that while federal evaluation is a more mature, more secure, and less isolative endeavor than it was in previous years, the field still has a long way to go. He cited a 2013 report from the U.S. Government Accountability Office (GAO): it found that less than half of 24 agencies surveyed conducted any evaluations at all of their programs, and only 7 had a centralized leader with responsibility for overseeing evaluation activities. Another issue, Whitehurst noted, is that the evaluations often do not reach their most important audiences. According to that same GAO report, more than half of senior government leaders had no experience with evaluation of the programs for which they were responsible. He encouraged participants to consider the history of federal program evaluation, its current status, and the possible development of more formal principles and practices.
Moderator Howard Rolston (member, steering committee) began the session on the history of federal program evaluation by praising the field for the progress that has been made over the past 50 years. He noted that the continuous growth, progress, and improvements in evaluation can lead observers to think these advances are simply a product of “the inevitable march of science”: moving forward, learning more, and progressing through innovation. He cautioned, however, not to take the progress for granted, because although the overall trend has been more and better evaluations, there have been setbacks and points at which the field of evaluation has come under threat. Rolston noted that much of the past decades’ progress has resulted from the individual efforts of committed federal staff, philanthropic funders, committed academics, and advocates for evidence-based policy. He has seen the trend move toward creating institutionalized
structures that can protect the quality and dissemination of evaluation findings—one element of that trend being the formulation of evaluation principles with practices in place to support them.
Larry Orr (Bloomberg School of Public Health, Johns Hopkins University) started his presentation by sharing the two overarching questions the panelists decided were most useful to address in terms of the history of evaluation: What have been the major challenges to the federal government in generating and using rigorous independent research? What circumstances over time have reduced or exacerbated vulnerabilities in evaluation work? He noted three challenges and discussed how they have evolved over time: resources for research and evaluation, resistance to rigorous evaluation, and convincing policy makers to use evaluation results.
In terms of resources, Orr said that in the 1970s, when he was director of the Office of Income Security Policy Research, then a part of the Office of the Assistant Secretary for Policy and Evaluation (ASPE) in the U.S. Department of Health and Human Services, his unconstrained operating budget was $25 million, which is essentially equivalent to $100 million in 2016. He researched and found that in 2016, ASPE’s entire budget was $56 million. (He did note, however, that the decrease may be due in part to a transfer of several ASPE research responsibilities to another office, which had a 2012 operating budget of $107 million—still just roughly above his evaluation budget four decades earlier.) Orr surmised that there has been limited progress by way of increasing resources for program evaluation and that it is still grossly underfunded by an “order of magnitude.” By comparison, he mentioned how medical researchers spend $30 billion to conduct 10,000 clinical trials (a form of evaluation) a year; in contrast, in social policy billions of dollars are spent on programming but significantly less on finding out whether or not those programs work.
Orr said that he sees resistance to rigorous evaluation both in the government and, surprisingly, in the research community. He cited Fighting for Reliable Evidence (Gueron and Rolston, 2013) as an account of the challenges of incorporating random assignment experimentation in social policy research. He also discussed how Equality of Educational Opportunity (Coleman, 1966) essentially turned the education community against quantitative evaluation for several decades. The prevailing theory at the time was that in order to understand education one had to look on a micro level at the success of individual school systems, which is not helpful in setting national policy. The establishment of the Institute of Education Sciences in 2002 changed that rationale, and its What Works Clearinghouse now has identified nearly 600 well-conducted randomized trials on education programs and practices. Orr noted that the field of international development also initially resisted evaluation, but since 2000 there have been approximately 1,700 randomized controlled trials in developing countries:
770 of those were conducted by the Poverty Action Lab at the Massachusetts Institute of Technology, whose mission is “to reduce [international] poverty by ensuring that policy is informed by scientific evidence.”3 He sees this as a clear indication of progress for evaluation.
Convincing policy makers to act on research results is one of his biggest challenges in the field, Orr said. He reminded the group that evaluation is only one of the many factors that influence policy, and it currently plays a very small part. He mentioned how in Show Me the Evidence: Obama’s Fight for Rigor and Evidence in Social Policy (Haskins and Margolis, 2015) there is a pie chart depicting all of the factors that influence social policy. Among the categories, which include advocacy groups, committee staff, and news media, the slice for research is one of the smallest, at just 1 percent. Orr believes, however, that the role of evaluation will continue to grow because of an increasing number of congressional mandates for rigorous research and because of the establishment of rigorous analysis as a standard in policy making by the Congressional Budget Office. He also noted OMB’s efforts toward the increased use of rigorous evidence—namely, the Bush administration’s PART (Program Assessment Rating Tool) and President Obama’s evidence-based policy initiatives.
Jean Grossman (Princeton University and MDRC) spoke about the challenges she faced both from within the federal government as an evaluation officer and in her role as a federal contractor. She noted three main issues: politics, money, and regulations. She said that politics is “the elephant in the room,” and evaluators are constantly fighting political pressure and a reluctance to hear or release the results of an evaluation that do not align with the program’s original expectations. She said many policy makers and others view evaluation as a way of determining whether or not a program works, when it is more about determining whether or not a program works better than something else. Grossman remembers wishing when she was the chief evaluation officer at the Department of Labor that there had been a safe way or space in which to conduct evaluations without the looming fear of defunding—where the environment was centered more on continuous improvement.
As an evaluator, Grossman recalled instances in which results were not released if they did not reinforce expectations, and she even on rare occasions felt pressure from a funder to reword an evaluation to better align with the current policy. She believes that not releasing reports happens less now that evaluation agencies are registering their evaluations and publicize their reports’ due dates. Political pressure can occasionally prove to be beneficial, however, when it is used to inquire about evaluation and ask for the public release of information.
Political timing is another factor, Grossman said, since the average 4-year time horizon for most policy makers often requires that programs be evaluated in that time frame, which is often too short a time from their inception for a meaningful evaluation. With all the iterative changes that occur in the initial months of a program’s launch, the services provided to participants in a randomized sample may change from those that were originally planned to be evaluated.
With regard to money, Grossman pointed out that only a small subset of federal funds goes directly to program evaluation—sometimes less than 0.5 percent for an agency. It is usually the case that programs do not get evaluated unless money has been earmarked or set aside specifically for that purpose. Seeking approval to use other administrative funding for evaluation can be difficult when, as she noted, the sentiments around evaluation and its uses are often negative.
Grossman described how regulations also add constraints to program evaluation—the biggest one she faced was the OMB Paperwork Reduction Act.4 While the target turnaround for OMB approval is 3 months, the average is 7-9 months. Considering that it may take a few more months to start a program and to develop an intake form or a baseline survey, a year could elapse before participants are enrolled and staff are able to collect critical baseline information. Since most evaluation contracts last less than 5 years, this is often time that the evaluation cannot spare. As a result, the Paperwork Reduction Act makes it difficult to obtain the requisite baseline data needed for a thorough comparison and essentially limits the work that can be done. The cost and work hours required to complete an OMB clearance package can also be prohibitive for some contractors.
Ron Haskins (Brookings Institution) talked about his experience with the 21st Century Community Learning Centers: the findings from the Mathematica evaluation showed that the program did not affect student outcomes,5 but those findings were met with much resistance, both by academics and politicians who strongly advocated for the program. Then-candidate for California governor Arnold Schwarzenegger used strong community support and anecdotal evidence to justify his stance; to Haskins, however, any sentence akin to saying “Everybody knows this program works” is an enemy of evidence-based policy.
4 “The purpose of the Paperwork Reduction Act (PRA), which governs information collections, is to minimize paperwork, ensure public benefit, improve Government programs, [and] improve the quality and use of Federal information to strengthen decision making, accountability, and openness in Government and society”: www.doleta.gov/ombcn/ombcontrolnumber.cfm [May 2017].
5The $1.2 Billion Afterschool Program That Doesn’t Work. Available: www.brookings.edu/research/the-1-2-billion-afterschool-program-that-doesnt-work [May 2017].
Haskins stressed the importance of having a statute that requires an evaluation when establishing or appropriating money for a program. He said that going a step further and adding language in the statute about random assignment can also prove very useful; adding evaluation language in the Welfare Reform Act of 19966 improved the utility of the resulting programs and the quality of the data collected for evaluations. Similar language was also added to the Senate legislation for the World Bank. He talked about the discussions he had with Hill senior staffers and how many of them knew about random assignment when he conducted interviews for Show Me the Evidence (Haskins and Margolis, 2015).
To show how the conversation on evaluation has evolved in just a short time, Haskins highlighted an excerpt from the Affordable Care Act (“Obamacare,” 42 U.S.C. 711) on early childhood home visiting programs, which specifically calls for evaluation through rigorous randomized controlled research designs. Haskins said that the Brookings Institution will release results from the largest evaluation to date in 2018, which will include data from home visiting programs across the country and include findings on program implementation and impacts based on the multiple randomized controlled trials. Under this same legislation, the Office of Adolescent Health released 41 evaluations of local teenage pregnancy prevention programs in fall 2016, most of which were random assignment studies.
Rolston circled back to his point about how the campaign for rigorous evaluation was initially spawned by invested individuals and their common interests, but that it has evolved into something more institutional. When he invited comments from participants, Judith Gueron (member, steering committee) emphasized the need to take the potential for future threats to program evaluation very seriously. She talked about how Reagan administration officials viewed most social science researchers as left-wing ideologues whose work was not objective or specific. As a result, they drastically reduced both program and evaluation budgets, which led to reductions in the evaluation workforce and unfinished studies. However, this reduction led to a shift to foundation-funded studies, which, because of their independence from government entities and the use of strong methodology (including randomized controlled trials), became widely accepted within both major political parties. Evaluators used an effective and unbiased communication strategy that educated various audiences and spurred community support for rigorous research. Gueron encouraged the group to keep this in mind as it considered how to fortify the principles and practices of federal program evaluation.
Lauren Supplee (Child Trends) added two points about staffing. First, she noted that human resource challenges and regulations within the federal government can make it difficult for evaluation offices to hire quality staff. Second, she noted the need for capacity at the level of senior leadership in program offices to understand scientific evidence and be able to identify its uses and limitations. She believes that it is critical to focus on these staffing needs.
Sandy Davis (Bipartisan Policy Center) echoed Gueron’s point about the effectiveness of analysis in a political setting. He added that independent evaluations in any field are not going to be successful in an intense political setting unless the evaluation offices have a known history of being objective; that objectivity can prove useful when speaking to politicians and members of Congress. Davis noted that while there often are other political forces at play when it comes to decision making, it is important that evaluations and evidence have a seat at the table. He also stressed how important it is for the evaluation to be conducted independently and in a timely manner.
Haskins agreed with Davis that there will always be people opposed to evaluation and reiterated the need to include evaluation requirements in legislation. He said that appropriations committee members and their staffs do not like to hear that a program they sponsored does not work or does not produce a major impact, yet that is the evaluation result for over 80 percent of social programs. He said the focus should be on solving a problem, not saving a program.
Rebecca Maynard (member, steering committee) spoke about her experience with evaluation of an abstinence program (Devaney et al., 2002) and how she and her colleagues used a strong technical working group to help them navigate difficult political challenges. She said that there was initial bias in the evaluation community against working on such a seemingly taboo topic, but noted that the study changed her view on research. With regards to abstinence, Maynard said it is clear that abstinence is a sure way to prevent pregnancy; the question was whether teaching abstinence without information on alternative methods, which is what the policy implied, was more effective than teaching abstinence alongside other contraception measures. Maynard said she and her colleagues designed the study so it was a “win-win”—focused on neither abstinence nor comprehensive sexual education, but instead on the differences in outcomes that resulted from the abstinence policy compared with the status quo. The focus was on the health and welfare of the children and not on the success of the program; consequently, both proponents and opponents of abstinence-only policies would have an interest in the results. She said that that kind of objectivity and respect for the design of the evaluation was imperative.
Whitehurst wrapped up the history session by summarizing what he heard as the three main points coming out of the discussion. First, it is
important to include evaluation practices in legislation. Second, evaluations should be conducted with objectivity, and Congress and other stakeholders should keep their direction with regard to the purpose of an evaluation at a broad level rather than specifying detailed questions that may be impractical or impossible to answer. Third, evaluation agencies should have access to funding that is adequate to carry out high-quality evaluations that are linked programmatically so as to produce knowledge that will be useful in the long term.