Every year, public and private funders spend many billions of dollars on large-scale, complex, multi-national health initiatives. The only way to know whether these initiatives are achieving their objectives is through evaluations that examine the links between program activities and desired outcomes. Investments in such evaluations, which, like the initiatives being evaluated, are carried out in some of the world’s most challenging settings, are a relatively new phenomenon. As such, it is worthwhile to reflect on the evaluations themselves to examine whether they are reaching credible, useful conclusions and how their performance can be improved.
In the last 5 years, evaluations have been conducted to determine the effects of some of the world’s largest and most complex multi-national health initiatives. On January 7–8, 2014, the Institute of Medicine (IOM) held a workshop at the Wellcome Trust in London to explore these recent evaluation experiences and to consider the lessons learned from how these evaluations were designed, carried out, and used. The statement of task for the workshop can be found in Appendix A. The workshop brought together more than 100 evaluators, researchers in the field of evaluation
1 The planning committee’s role was limited to planning the workshop. The workshop summary has been prepared by the workshop rapporteur (with the assistance of Charlee Alexander, Bridget Kelly, Kate Meck, and Kimberly Scott) as a factual summary of what occurred at the workshop. Statements, recommendations, and opinions expressed are those of individual presenters and participants; they are not necessarily endorsed or verified by the Institute of Medicine, and they should not be construed as reflecting any group consensus.
science, staff involved in implementing large-scale health programs, local stakeholders in the countries where the initiatives are carried out, policy makers involved in the initiatives, representatives of donor organizations, and others to derive lessons learned from past large-scale evaluations and to discuss how to apply these lessons to future evaluations. The workshop was sponsored by the Bill & Melinda Gates Foundation, the Doris Duke Charitable Foundation, the Wellcome Trust, and the William and Flora Hewlett Foundation.
This workshop did not attempt to provide a comprehensive review of the rich body of literature on program evaluation theory or practice (Berk and Rossi, 1999; Leeuw and Vaessen, 2009; Rogers, 2008; Rossi et al., 2004; Royse et al., 2009; Stern et al., 2012; White and Phillips, 2012), but the evaluation examples that were examined drew on an expansive array of the available evaluation methodologies and applied them in different ways to the large-scale, complex, multi-national health initiatives. As a result, they have produced a large body of experience and knowledge that can benefit evaluations of health and development initiatives. The workshop looked at transferable insights gained across the spectrum of choosing the evaluator, framing the evaluation, designing the evaluation, gathering and analyzing data, synthesizing findings and recommendations, and communicating key messages. The workshop explored the relative benefits and limitations of different quantitative and qualitative approaches within the mixed methods designs used for these complex and costly evaluations. It was an unprecedented opportunity to capture, examine, and disseminate expert knowledge in applying evaluation science to large-scale, complex programs.
This workshop report summarizes the presentations and discussions at the workshop and is intended to convey what transpired to those involved or affected by large-scale, multi-national health initiatives, including implementers, stakeholders, evaluators, and funders of initiatives and evaluations.
In her opening remarks at the workshop, Ann Kurth, professor of nursing, medicine, and public health at New York University and chair of the planning committee for the workshop, offered how the terms used in the workshop’s name were being applied:
- Large-scale—The total cumulative budgets over multiple years amounting to at least hundreds of millions of U.S. dollars
- Multi-national—Implementation on a global scale, including multiple countries and regions or subregions of the world
- Encompassing multiple components, such as varied types of interventions and programs implemented in varied settings, systems-strengthening efforts, capacity building, and efforts to influence policy change
- Implementation at varied levels within partner countries through a large number of diverse, multisectoral partners, including an emphasis on local governments and nongovernmental institutions
Evaluations of four large-scale, complex, multi-national health initiatives acted as core examples for the workshop:
- Global Fund to Fight AIDS, Tuberculosis, and Malaria (GF) (Sherry et al., 2009)
- U.S. President’s Malaria Initiative (PMI) (Simon et al., 2011)
- Affordable Medicines Facility–malaria (AMFm) (Tougher et al., 2012)
- U.S President’s Emergency Plan for AIDS Relief (PEPFAR) (IOM, 2013)
Appendix C provides a comparison of the evaluations for these initiatives. In addition, the workshop examined other evaluations of large-scale health and development initiatives along with smaller-scale evaluations that share similar features of complexity.
Evaluations need to be credible, rigorous, feasible, affordable, and matched to the priority evaluation questions, aims, and audiences, Kurth said. No single evaluation design can serve every purpose, and every evaluation must make strategic choices that fit its context and goals. But the process of designing and conducting an evaluation has key strategic decision-making points, and the available choices have different advantages and disadvantages that result in trade-offs for any given design decision. Evaluations of complex initiatives require more complicated strategic design considerations, but many of the issues discussed at the workshop are applicable to evaluations all along the spectrum of complexity.
Though the workshop sought to identify lessons learned, it was not designed to look backward, said Kurth. Rather, the underlying objective was to be “candid, open, and frank” about past experiences to create a foundation for future improvements.
In the final session of the workshop, some of the important messages over the previous 2 days were recapitulated by three experienced evaluators (Chapter 12 provides a full account of their remarks):
- Sanjeev Sridharan, director of the Evaluation Centre for Complex Health Interventions at Li Ka Shing Knowledge Institute at St. Michaels Hospital and associate professor in the Department of Health Policy, Management, and Evaluation at the University of Toronto;
- Charlotte Watts, head of the Social and Mathematical Epidemiology Group and founding director of the Gender, Violence, and Health Centre in the Department for Global Health and Development at the London School of Hygiene and Tropical Medicine; and
- Elliot Stern, emeritus professor of evaluation research at Lancaster University and visiting professor at Bristol University.
The workshop then closed with reflections from representatives of the four funders of the workshop— Gina Dallabetta of the Bill & Melinda Gates Foundation, Mary Bassett of the Doris Duke Charitable Foundation, Jimmy Whitworth of the Wellcome Trust, and Ruth Levine of the William and Flora Hewlett Foundation—on the major lessons and messages they were taking away from the event.
The following messages of the workshop are drawn from these speaker’s remarks. These should not be seen as recommendations or conclusions emerging from the workshop, but they provide a useful summary of some of the major topics discussed.
What Evaluations Can Do
Evaluations typically have multiple objectives, said Charlotte Watts, head of the Social and Mathematical Epidemiology Group and founding director of the Gender, Violence, and Health Centre in the Department for Global Health and Development at the London School of Hygiene and Tropical Medicine. Some evaluations are focused specifically on assessing an intervention’s impact and cost-effectiveness, but others have broader public good aspects. An evaluation may also aim to derive lessons about scaling up or replicating effective interventions, build capacity for evaluations, or strengthen networks of researchers and practitioners.
Ruth Levine, director of the Global Development and Population Program at the William and Flora Hewlett Foundation Evaluations, commented that evaluations are used to hold governments, funders, and other
stakeholders accountable for the use of the resources that are dedicated to large initiatives that have proliferated and have high political visibility. Similarly, Jimmy Whitworth, head of population health at the Wellcome Trust, notes that policy makers have been challenging the public health community to learn more about the effects of its interventions as a way to justify and increase investments in large-scale initiatives. To that end, evaluations of public health investments may inform not only program improvements, but also policy and funding decisions.
Sanjeev Sridharan, director of the Evaluation Centre for Complex Health Interventions at Li Ka Shing Knowledge Institute at St. Michaels Hospital and associate professor in the Department of Health Policy, Management, and Evaluation at the University of Toronto observed that very few large-scale, complex, multi-national initiatives are well formed from their earliest stages and noted that evaluations can also contribute to the development of an initiative. This may require a changing relationship with evaluators over time, but it can build capacity in both the evaluation and the initiative that can lead to continual improvements.
Governance and Evaluator Independence
Gina Dallabetta, a program officer at the Bill & Melinda Gates Foundation, appreciated that the workshop focused on the larger view of evaluation, including issues such as governance. In particular, the question of how independent evaluators should be was raised by several of the participants. These initiatives are incentivized to claim success so that they can maintain high levels of resources and political commitment, Levine noted. On the other hand, Sridharan noted that program staff are generally among the most critical observers of their programs. It does not take a faraway researcher to be objective. Sridharan proposed a nuanced position with degrees of independence depending on the phase and intent of the evaluation. For an evaluation early in a project, a close relationship with an evaluator may allow for valuable input to program staff as they design or modify an intervention. A results-focused evaluation may need to achieve more independence from a program to deliver unbiased results. Elliot Stern, emeritus professor of evaluation research at Lancaster University and visiting professor at Bristol University, added that it may be possible to have different people involved in different evaluation phases to obtain the appropriate levels of independence.
Evaluation Framing and Design
Evaluations need to prioritize the questions they are asking, said Levine, which means thinking through the kinds of questions that could change the
minds of decision makers, whether within the program or at a higher level. Watts stated that understanding program effectiveness cannot be reduced to answering a closed-ended question about whether “it worked.” Perhaps the evaluation questions should be more nuanced: Can you do it at scale? Can you do it with this population? Can you sustain it? This provides a greater space for the framing of questions, the evaluation design, and for partnership between evaluators and program staff.
A wealth of techniques and methodologies are available for evaluation, but the strength of an evaluation lies in careful design. Many of the speakers highlighted that an underlying initial step is to articulate an underlying program theory or similar framework to understand the fundamental assumptions that need to be interrogated. Watts emphasized the importance of designing a mixed methods approach to achieve a full understanding of a large-scale, complex, multi-national initiative. Qualitative and quantitative work can be nested in parallel, and qualitative and quantitative data can be triangulated to increase confidence in the evidence base for evaluation findings.
The evaluators emphasized the critical importance for evaluation design of understanding the relationship between context and the desired outcomes for intervention and evaluation designs. Sridharan noted it is best to bring the knowledge of context in at the start, but also to understand how it is evolving and adapting over time. Mary Bassett of the Doris Duke Charitable Foundation Evaluators reiterated the need to devote heightened attention to context, referring to how Elliot Stern in his comments “really challenged us to unpack the notion of what context means.” Contextual issues arise on micro-, meso-, and macro-scales, and all can be important. Droughts, economic crises, and political changes are some factors that can affect the outcome of an initiative and should be tracked, but it is also important to think about how to understand issues of leadership, power, trust, communication, and community engagement that have all been talked about, Bassett said.
Data Availability and Quality
Many types of data collection can be prospectively embedded within a program for evaluation. Gina Dallabetta, a program officer at the Bill & Melinda Gates Foundation, emphasized strengthening capacity for quality data collection in countries, especially as projects become larger and more complex. Dallabetta noted that program evaluations can be hampered by a lack of quality routine data collected within countries, reflecting a need
for management expertise to help countries collect better data, including process data, outcome data, and financial data. These data can be used for both program monitoring by management and evaluation, but data collection and analysis need to be based on a theory of change before program implementation begins, said Dallabetta.
However, evaluators also need to do original data collection to have ways to validate the data reported by programs, Levine said. She also observed that many more sources of nontraditional data will be available in the future, such as geospatial analyses, citizen feedback, transactional data about what people are purchasing and where they are going, and sensors such as smart pill bottles.
All four of the workshop funders emphasized the importance of open data, so that the information on which conclusions are based is available to others. Levine noted that if data can be made available to others for reanalysis, this can reinforce technical quality. Dallabetta added, however, that data sometimes belong to governments or have multiple owners, which may require that one centralized place exist in a country where people can view data. Maintaining open data also requires work, such as data archiving and documentation, that donors need to build into their funding.
Using the Results of Evaluations
Though the use of an evaluation’s results can be one of the factors least in control of an evaluation team, evaluators can enhance the use of their results in a variety of ways, said Levine. An especially promising approach is to meaningfully engage a wide range of stakeholders in an ongoing way. Evaluators also can encourage systematic follow-up of recommendations, in part by creating a culture of learning versus one of punishment. Evaluations need adequate planning, skills, and budgets for a fit-for-purpose dissemination, Levine said. Whitworth observed that the public health community also needs to do a better job of celebrating and publicizing its successes as a way of increasing support for large-scale programs. Large-scale, complex, multi-national initiatives have produced some of the biggest success stories of international health and development assistance, and those stories have been backed up by credible evidence, said Levine. Watts similarly observed that strong evaluations require resources, commitment, investments, trust, and strong relationships, but they can be tremendously beneficial for public health.
Final Reflections on Future Large-Scale, Complex Evaluations
As part of the workshop’s final session, Levine shared some thoughts about future evaluations of large-scale, complex, multi-national initiatives
as well as other evaluations that will benefit from the information shared at the workshop.
One important lesson derived from recent large-scale initiatives is how to increase the space for serious evaluations, said Levine. The public health community has a tradition of basing program design on good evidence and then learning as it goes based on additional evidence. “The potential for evaluations to actually make a difference is there,” said Levine, also observing that improving the technical quality of evaluations is a demanding task.
One important trend that will influence future evaluations is a new partnership model with the countries in which programs are being implemented and evaluated. Evaluations need mandates from governments and donors doing rigorous work, Levine observed, and they need the funding to be able to do that work. At the same time, evaluations need the governance and advisory structures to be insulated from political influence.
The advocacy community can support evaluations in this regard by praising initiatives that not only do evaluations but then make use of findings to correct shortcomings. “The very same advocates who are so good at pushing for more money for global health programs can, and sometimes have been, very capable advocates for evaluating and using the findings from evaluations for more effective programming,” said Levine.
Another trend that will make itself felt in the future is an increase in “uninvited co-evaluators.” Many people have access to evaluation information who have an interest in challenging not just the program but the evaluation. As Levine said, “There is a lower barrier to entry into the conversation, and that is in some abstract way a positive thing, [but] in the day-to-day reality, it’s very challenging.”
Finally, Levine closed with a list of potential activities or steps for improvement for evaluators and funders:
- Document the stories of evaluations.
- Create greater value in global collaborations.
- Be honest about the feasibility of sponsors’ demands.
- Participate in method development and validation.
- Connect with the evaluation community outside of the health sector.
- Train the next generation.
- Embrace transparency.
- Create incentives for learning.
- Make reasonable demands of evaluators, and fund at the right level.
- Permit or require transparency.
Following this introductory chapter, Chapter 2 of this summary of the workshop describes an overview framework for evaluation design, introducing many of the topics discussed at the workshop.
Chapters 3–8 are arranged to present the major components of design and implementation in roughly the sequence that they might be addressed during the course of an evaluation, although each evaluation is different and many of these components are typically addressed or re-addressed iteratively and continuously throughout the evaluation process.
Chapter 3 discusses how evaluations are framed when choosing the evaluator, establishing the governance structure of the evaluation, and developing and prioritizing the evaluation questions. Chapter 4 examines the development of an evaluation’s design and particularly the methodological choices and trade-offs evaluators must make in the design process. Chapter 5 considers data sources and the processes of gathering and assessing data.
Chapters 6 and 7, which are drawn from two of four concurrent sessions that were held during the workshop, examine methodological and data issues in more detail. Chapter 6 looks at the application of qualitative methods to evaluation on a large scale, while Chapter 7 does the same for quantitative methods. Chapter 8 then turns to the use of triangulation and synthesis in analyzing data from multiple data sources and across multiple methods to yield a deeper and richer perspective on an initiative and increase confidence in the evidence base for evaluation findings.
Chapters 9 and 10, which are drawn from the other two workshop concurrent sessions, explore specific extensions of some of the ideas discussed earlier in the workshop. Chapter 9 looks at evolving methods in evaluation science, including realist methods and nonexperimental, observational, and mixed methods. Chapter 10 discusses how principles that are important for large-scale program evaluations can similarly be applied to complex evaluations on a smaller scale. Chapter 11 then examines how the findings and key messages of an evaluation are used and can be disseminated to diverse audiences. In Chapter 12, three experienced evaluators reflect on the messages of the workshop and how they might apply to future evaluations through the lens of a hypothetical evaluation design exercise for a fictional multi-national initiative.