Important Points Made by the Speakers
- By prioritizing and organizing the questions to be addressed by evaluations into manageable units, realistic instruments and a framework for conducting the evaluation can be designed.
- Evaluations require that trade-offs be made along a number of dimensions, including the balance of independence and interdependence.
- Multiple goals for an evaluation may not be incompatible but often require different approaches.
- Evaluations can enhance their value by building in-country capacity and by involving more local participants in the evaluation.
Any evaluation effort starts by framing the evaluation. In the workshop’s opening session, five panelists discussed various approaches for this key initial step. From their individual experiences, the panelists addressed such issues as developing and prioritizing the evaluation questions, defining the audiences and intended uses for the evaluation, the relationship between the evaluators and the evaluands, and trade-offs in choosing the type of evaluator and in identifying and establishing the governance structure for an evaluation. Evaluations of large-scale, complex, multi-national initia-
tives are themselves going to be complex, which requires, as pointed out in the previous chapter, well-managed pursuit of discrete tasks.
Framing an evaluation starts with a small set of well-defined questions, said workshop planning committee member and panel moderator Jonathon Simon, who is Robert A. Knox Professor and director of the Center for Global Health and Development at Boston University. What often happens, however, is that evaluators are presented with “laundry lists from smart people [who are] passionately committed to issues within agencies or organizations that want to know everything about everything.” Simon gave several examples from the evaluation of the U.S. President’s Malaria Initiative (PMI), noting that the initial evaluation request from the PMI included a list of 82 questions that contained another 50 or so questions nested within that list. It is essential, then, to prioritize and organize the questions in manageable units that can be used to design realistic instruments and a framework for conducting the evaluation.
It is then necessary to consider the audiences for the results of the evaluation beyond the discrete audience of those in charge of the effort being evaluated. For the evaluation of PMI, explained Simon, the Washington Post was an audience, as was a group of think tanks that had been criticizing the initiative. A large political audience for the evaluation was more concerned with whether the PMI was working and less interested in the technical evaluations of the interventions. Financial considerations were also a factor, given that the PMI was up for reauthorization.
Within the U.S. Agency for International Development (USAID) and the Centers for Disease Control and Prevention (CDC)—the agencies in charge of the PMI—the leaders of both organizations had substantive interest in the evaluation, but the evaluation goals of the two agencies were different. An additional consideration was that the evaluation was not mandated by Congress but was commissioned by the initiative’s director. For the evaluators, an important audience was the group of people leading the country-level efforts and implementing the program on the ground. “Could we do an evaluation that actually added value or contributed to national malaria control programs, and could the country personnel actually benefit or learn from the evaluation?” Recognizing the ways in which the results will be reported and used and the legitimate needs of the different audiences for an evaluation points to the complexity of designing an evaluation, said Simon.
Regarding the relationship between the evaluators and the evaluand, Simon said that the reason he was asked to conduct the evaluation was that he was perceived to be independent of the “malaria mafia.” While that was
the case, Simon said that the evaluation was funded by grants from USAID, CDC, and other agencies of the U.S. government. To maintain the objectivity of the evaluation team, Simon insisted on operational independence. “We took control of the process once the original framing was done, and we had a set of agreements that the agencies would not see anything until they received a draft report,” he said. While it was important to maintain operational independence from the funders of the evaluation, Simon and his colleagues often had to rely on the PMI staff to gain the cooperation of the in-country teams. To illustrate that objectivity required a balance between independence and interdependence, Simon explained that while the evaluation team received 167 comments from the funders in response to the draft report, the evaluators chose which comments to address and which to reject.
Maintaining the right balance in terms of independence and interdependence ties into the issue of trade-offs. Simon identified seven trade-offs that were made in evaluating the PMI. There were methodological trade-offs in terms of setting the right mix of qualitative and quantitative methods and the use of data from routine monitoring programs. Given that the PMI is active in 15 countries, there were geographic trade-offs; in the end, the evaluators conducted site visits in 5 countries and relied on telephone interviews in the other 10 countries, resulting in some degree of selection bias. There were trade-offs in terms of which technical interventions were assessed from a functional perspective, such as indoor residual spraying versus bed nets. Another set of trade-offs involved the priority given to the various audiences, including political, financial, programmatic, and country-level audiences. Time and money were not infinite, which also necessitated trade-offs. Finally, there is the value trade-off between the perfect design and results that are useful and informative. “How you do the value trade-off is one of the key challenges that we deal with in these large, multicountry, complex evaluations,” he said in closing.
The United Nations (UN) Office of Inspection and Oversight (OID), which is one of three bodies with oversight functions at the UN, is responsible for evaluating 32 different UN programs and entities that engage in a wide range of activities, from peacekeeping operations to humanitarian and environmental programs, explained Deborah Rugg, director of the Inspection and Evaluation Division at the UN Secretariat. Her office has 22 professional evaluators on staff, largely methodologists, and it contracts with external experts for subject-matter expertise. OID reports through the UN Secretary General to the 193 member states. The fact that these evaluations are mandated gives her office both authority and funding, which makes
what Rugg characterized as a huge difference in terms of participation by and cooperation from staff with the evaluated programs.
Each year, her office conducts an average of eight assessments that examine the extent to which a program has been funded, its size, how many evaluations of the program have been conducted, if there is a need for evaluation, and if there are any current topical issues germane to the need for evaluation. For example, many of the peacekeeping evaluations are based on current political and contextual issues that need urgent attention as well as the capacity within that entity to do an evaluation. She noted that at one time OID evaluations were largely for accountability purposes and offered little information in terms of value, which meant that evaluations were largely noncollaborative activities. Today, Rugg and her colleagues use a partnership model that solicits input on what needs to be evaluated to answer important operational and functional questions. This partnership approach has led to increased use of the evaluation reports, she said.
The issue of independence is an important one at the UN and for the UN evaluation group, and Rugg pointed to three levels of independence. Institutional independence means that her group operates as an independent group outside of a program without an institutional direct line of report. Operational independence means that while the evaluation of UN programs is conducted by a UN office, her group sits outside of the programs that it evaluates. Behavioral independence refers to an absence of coercion by the program being evaluated or of a conflict of interest for those conducting the evaluation. “I have to prove in all of OID’s evaluations that we are not unduly influenced by the program, or more importantly, by any specific country,” explained Rugg. Some programs are evaluated more frequently than others, but on average, programs can expect to be evaluated about once every 8 years, which she said is a reasonable time frame if there also are internal embedded evaluations to answer more timely and program-relevant questions. “That’s one of the trade-offs with these large-scale, infrequent evaluations is that they can address high-level issues with a global context, but they cannot drill down as effectively as one might expect,” she said. “We would like to see more internal evaluation capacity building so that that can answer specific questions in a more timely basis.”
To increase utilization of findings, evaluations now start with a 3-month inception phase in which her office holds conversations with potential users, reviews prior evaluations, and attempts to develop a better understanding of how an evaluation can be useful to program staff as well as to the UN as a whole. After completing an evaluation, her group works with the evaluated program and conducts follow-up sessions to check on implementation of any recommendations suggested by the evaluation or that are mandated by the member states. A typical evaluation takes about 1 year, which includes the 3-month inception period followed by 3 months of data col-
lection, 3 months of analysis and writing, and a 3-month clearance process. Rugg characterized this as a short period of time that balances a trade-off between producing actionable and timely results of a program against depth of experimental and analytic design.
In terms of the the Joint United Nations Program on HIV/AIDS (UNAIDS) program, Rugg said that ongoing internal evaluations are focusing on performance monitoring, while an external, independent evaluation is conducted every 5 years and a variety of ad hoc special evaluation studies focus on specific programmatic issues. In addition, in-country residents in regional offices around the world work to support the national governments’ evaluations and capacity building.
After agreeing with the points that the previous speakers had made, Christopher Whitty, chief scientific advisor at the United Kingdom’s Department for International Development (DFID), noted that, in his view, program officials who work outside of the health care arena have not historically had much appreciation for the fact that “good ideas, passionately delivered by people to the highest quality, may not work.” As a result, outside of health care, not much value has been placed on evaluation, though he acknowledged that this situation is changing for the better. Other positive developments, he said, include the improvement in the methodologies available for performing complex evaluations and greater acceptance that mixed methods approaches, or using both quantitative and qualitative methods for data collection and analysis, are important for evaluations.
In his role as a commissioner at DFID, Whitty is on the receiving end of evaluations and has seen a number of outstanding evaluations over the past few years, including the independent evaluation of the Affordable Medicines Facility–malaria (AMFm). Most evaluations, however, have not been outstanding, and he observed that some of the reasons are on the side of those who request and fund evaluations. The biggest problem from the donor perspective, he said, is that those who commission evaluations have multiple goals for the evaluation that, while not necessarily incompatible, actually require distinctly different approaches. One goal is to provide assurance to those who pay for a given program—the British public in his case—that their money is not being wasted. A second goal is to check on the efficacy of a program and make course corrections if needed. The third goal is impact evaluation—what about a program has worked and what has not, what has been cost-effective and what has not, and what aspects can be improved in the next iteration of the program?
The problem, said Whitty, is that those who ask for evaluations often
are asking for a single evaluation that meets all three goals at the same time. “If someone asks you for all three, you have to tell them that they are different things and they are going to have to pay more and probably have to do it by at least two different mechanisms.” This discussion has to take place up-front between the person who would do the evaluation and the person commissioning an evaluation to avoid wasting time and money on pointless activities, he added. Another confounding factor is that most of the large, complex programs are conceived by what Whitty characterized as “very smart, very politically connected, and very charismatic true believers.” The resulting political realities have to be considered in the initial design discussions between funders and evaluators.
On the side of the evaluators, poorly performed evaluations are often the result of the difficulty of evaluating complex programs. “What we are talking about here is intrinsically difficult. Many of these things are really hard to evaluate.… I do not believe there is such a thing as perfect design for most of the things we are talking about in this meeting.” He described assessing whether a design is poorly conceived based on whether he would change his mind about a program if the evaluation did not provide the answer he expected or desired. If the evaluation is not “sufficiently strong methodologically that you are forced to change your mind,” he stated, then “you probably should not do the evaluation in the first place. That seems to me to be a common sense test.” Another problem that he sees on the delivery side is that evaluations of complex programs require teams comprising individuals with a wide range of skills, and assembling such multidisciplinary teams is difficult. How to facilitate the formation of multidisciplinary teams is “something that we as donors as well as providers need to think through,” he said.
The Global Fund to Fight AIDS, Tuberculosis, and Malaria was established in 2002 as an international financing mechanism that would help countries scale up programs that were shown to be effective in pilot studies. The Technical Evaluation Reference Group (TERG) is an independent evaluation advisory group accountable to the Global Fund’s board for conducting an independent evaluation of the Global Fund’s business model and impact, and in November 2006 the board commissioned an evaluation after the first 5-year grant cycle. Working together, TERG and the board defined three study areas that were mutually interdependent and several overarching questions for each study area.
Ryuichi Komatsu, senior advisor for TERG at the Global Fund Secretariat, explained that the first study area focused on organizational effi-
ciency and effectiveness and addressed the overarching question of whether the Global Fund’s activities reflected its core principles, including country ownership and its actions as a financial instrument. The second study area examined whether the Global Fund’s partner environment was effective and efficient. This study area addressed two overarching questions: How effective and efficient is the Global Fund partnership at the country group level? and What is the wider effect of Global Fund partnership on a country’s health care delivery systems? The third study area looked at the impact of the Global Fund’s programs on disease by asking if there had been an overall reduction in the incidence of AIDS, tuberculosis, and malaria and what the Global Fund’s programs contributed to this reduction. TERG conducted three separate studies involving multiple countries at a cost of $16 million. The resulting evaluation, conducted by a consortia of organizations, took 3 years to complete from initial design to release of a synthesis report.
One lesson learned from this evaluation was that 3 years was a very short time frame for such a complex evaluation. As a result, Komatsu explained, there was little time for aligning the evaluation with in-country processes such as annual health department reviews or conducting national surveys. In addition, the short time frame resulted in some in-country task forces not being fully engaged in the evaluation process. Nonetheless, the Global Fund has used the results of the evaluation to create a new funding model that has been launched recently, and it has taken steps to address the evaluation’s shortcomings by emphasizing continuous smaller country-level evaluations on which to build comprehensive evaluations that can better inform ongoing grant management at the country level. This new model also reduces the logistical challenge of evaluating multiple countries, each with its own operational cycle, simultaneously.
To evaluate the impact of program scale in specific countries, TERG has decided to rely on country health-sector reviews and disease program reviews in the context of national strategies. These reviews are assisted by the World Health Organization (WHO) and UNAIDS, and they are conducted by a team of independent experts. “Results from such reviews can be practical and immediately fit into the management of grants, especially in the context of the new funding model of the Global Fund,” explained Komatsu. While it is challenging to achieve and maintain consistent quality across countries, TERG has been working with WHO to strengthen the guidance for these reviews, and it also has commissioned an independent consultant to conduct a midterm review of this evaluation strategy.
Finally, Komatsu explained that TERG now emphasizes five key principles in designing its evaluations: periodic, plausibility design, country platform, practicality, and partner approach, which means to build on, collaborate, and align evaluations with partners while maintaining and ensuring rigor and objectivity.
Robert Black, professor and director of the Institute for International Programs in the Department of International Health at the Johns Hopkins Bloomberg School of Public Health, compared the goals and approaches used to evaluate five different global initiatives with which he has been involved. He started by comparing the Global Fund evaluation discussed by Komatsu with the evaluation of the U.S. President’s Emergency Plan for AIDS Relief (PEPFAR). While the Global Fund evaluation looked at selected countries in the program and presented results for each country, the PEPFAR evaluation looked at the program as a whole and did not report country-specific findings. Another difference between the two evaluations was time frame—the PEPFAR evaluation was conducted over 4 years, which led to challenges in dealing with a program that was evolving as the evaluation was being conducted. Both evaluations focused on program performance, though the PEPFAR evaluation had a particular focus on prevention, care, and treatment targets as well as on how the initiative affected local health systems. In terms of who conducted the evaluation, the Global Fund’s effort was overseen by TERG and conducted by a consortium of five institutions working with in-country institutions. The PEPFAR study was conducted by the Institute of Medicine (IOM) as mandated by the U.S. Congress. Two IOM committees with the majority of members overlapping were involved in the study: one for planning, the other for implementation (IOM, 2013; IOM and NRC, 2010).
In Black’s opinion, the Global Fund’s evaluations made trade-offs among country ownership, objectivity, and independence of the assessment and among rigor, timeliness, and capacity building. For PEPFAR, Black emphasized what was in his view an unfortunate trade-off: not being able to report findings that were specific to individual countries, which was due to the framing of the original evaluation mandate and to the necessity of assuring country de-identification to receive secondary quantitative data and to maximize candor in qualitative data collection. As far as independence and objectivity in the PEPFAR study, Black said that the IOM is very strong on avoiding conflict of interest among committee members, and the process was developed and carried out with complete independence, with the sponsoring organization not receiving the report until it was finalized after extensive review by outside experts. In terms of the data collection, the qualitative data were independently collected, but for the quantitative data the evaluation relied heavily on the data from the program implementers.
The Integrated Management of Childhood Illness (IMCI) initiative evaluation was a prospective evaluation of effort in 60 nations that was conducted by a WHO advisory committee and in-country institutions with funding from the Bill & Melinda Gates Foundation. In this case, explained
Black, the UN agency responsible for the program also was responsible for the evaluation, and its focus was on quality of care, feasibility, and costs. For the IMCI assessment, the evaluation was limited to five countries selected by WHO that were thought to have the strongest implementation in order to assess the impact of the program on health. Though the evaluation was conducted by the implementing agency, there was a strong, independent advisory committee.
The retrospective evaluation of the Accelerated Child Survival and Development Program (ACSD), which operated in 11 West African nations, focused on quality of care, feasibility, costs, and in particular the impact of the program on child mortality; it was funded by Canada through the UN Children’s Fund (UNICEF). The ACSD evaluation was limited to countries that UNICEF claimed were benefiting from the program. “The fact that the independent evaluation did not find that made us very unpopular,” said Black. “Therefore, there was a great degree of discomfort with the independence of the evaluation.”
The ongoing evaluation of the 10-country Millennium Village initiative is looking at feasibility, cost, and achievement of program goals. However, there has been some concern, said Black, because the evaluation is being done by the program itself. There is an advisory committee, which he chairs, and their role is evolving. This evaluation is still being finalized and planned, he noted.
All of these evaluations, Black said, had the intent of measuring both program performance and health impact, which he characterized as a good thing. However, the feasibility and the timeline for these evaluations need to be questioned and thought through thoroughly, he said. He noted, too, that some aspects of country selection may compromise the generalizability and representativeness of the evaluation findings, and he reiterated his belief that there should be some obligation to provide feedback to the countries. He said that issues related to funding of the implementation of an evaluation also need to be thought through carefully. “In all of the examples I have seen, funding is linked in some strong way with the program, which to me compromises independence.” Black commented that the evaluations he described almost all have some kind of trade-off in their framing and design that limits what kind of evaluation can be done, for example, “in the selection of countries, the design of the implementation, the interpretation, or the control of funding.”
In the final presentation of the session, Carmela Green-Abate, the PEPFAR coordinator in Ethiopia, discussed how both the recent IOM PEPFAR evaluation and a prior IOM evaluation earlier in the implementa-
tion of PEPFAR have been used to affect the program. She first noted that the independence of the IOM as the evaluator allowed the evaluations to have a significant impact on the program, both in terms of funding and in a change in the program’s emphasis from getting medication to those with HIV/AIDS to one of preventing infection in the first place. The need to make this change was highlighted in the first of the two evaluations, and the success in making this transformation was highlighted in the second. The first evaluation also pointed out the need to develop health system capacity, and this finding was reflected in increased funding for this type of activity in the second round of PEPFAR grants.
Green-Abate noted that the lack of country-specific findings in the evaluation was frustrating and limiting. “Going forward, I think that there are opportunities to document best practices in the evaluation,” she said. She added, however, that based in part on the second evaluation, Congress has authorized a third phase of the program. “Without these independent evaluations, I do not think that Congress would have continued to fund this program at the same level,” she said.
She also remarked that the second evaluation emphasized knowledge management, including monitoring evaluation, innovation, and research. This has contributed to a new monitoring and evaluation framework from PEPFAR that is still being rolled out, but an enormous dilemma at the country level is alignment with the countries and the speed of the roll out. “If you really do want country ownership, you need to have time in which countries can change their health management and information systems in line and not have different systems.”
In closing, Green-Abate said that there is a real need to build capacity and involve more Africans in the program. “If you look at the trials, they are not led by Africans, and PEPFAR does not support their participation at scientific meetings. How can you expect country-level capacity to increase?” she asked. “I would suggest that the U.S. government is in a unique position to move forward in the third phase of PEPFAR to support the opportunities for innovative research and evaluation at the regional or country level in Africa.” In response to a question about what could be done to build more capacity in Africa, Green-Abate cited the Medical Education Partnership Initiative, a $10 million program funded by PEPFAR, as one approach that may work. This program designates African institutions and universities as the principal investigators, with U.S. universities serving as subcontractors. Its strength, said Green-Abate, is that it does not take talented African researchers out of their institutions and bring them to the United States, but rather leaves those excellent investigators in place where they can nurture younger investigators. “I think initiatives in which the African academic institutions are in charge, with links to the rest of the world, offer a real opportunity,” she said. During further discussion about building capacity,
Christopher Whitty from DFID added that in his opinion there is a real need to build up African institutions, but that “there are already African scientists who should be able to be at the forefront of doing this kind of work.” However, he said that while his organization often receives grant applications that list African investigators, he is frequently disappointed when the publication comes out and there are no African authors. Whitty described it as shameful when American co-investigators do not sufficiently involve their African colleagues once a grant is secured.
Sangeeta Mookherji of George Washington University, asked how the field can ensure that there is independence and objectivity when it comes to analyzing and interpreting data, not just collecting data. Black cited the PEPFAR evaluation as an example where the interpretation and analysis of data was independent, even though some of the data came from the program itself. The extensive review by outside experts helped ensure this independence. He also noted that there may be cases where program staff can provide insights that the evaluators can then respond to in their analysis, and he cited the evaluation of the PMI as an example of where program staff had a chance to comment on the analysis.
Whitty added while independence is critical to being able to trust an evaluation’s results, it may be difficult to understand all of the details of a complex project without input from the people on the ground. The trade-off between independence and understanding is difficult, but achieving the right balance is critical. “If you go too far in either direction, you are going to fail,” said Whitty. Lori Heise, from the London School of Hygiene and Tropical Medicine, asked if independence at the beginning of smaller projects might not be counterproductive. “I would argue, at the early stages of developing novel interventions, to have less independence between researchers and evaluators so that you are actually refining the program, and treat evaluation as a partnership.”
Sanjeev Sridharan, University of Toronto, asked the panelists if they had any ideas about how to deal with evaluation timelines with large, complex programs. Rugg noted the importance of developing a plan of how to feed information to evaluators in a timely and frequent manner that can enable evaluations throughout the life of a program. “If you do that, I think then it is more palatable and [will meet] the needs of multiple information requesters.” Whitty suggested evaluating subcomponents or particular questions within a much larger scheme. “Often, when you do that, you can, within the timeline you have, plan ahead, because most complex interventions have multiple interventions that are brought at different times in different places, and they allow certain subquestions to be
answered quite well within the timelines, even if you do not have the luxury of being able to plan it right from the beginning and evaluate the whole thing as a package,” he said.
Several of the presenters discussed the challenges of large evaluations that take place over multiple years and suggested different ways to evaluate or assess specific components of large-scale initiatives in a shorter time-frame to provide feedback more quickly. Green-Abate made a distinction between monitoring and evaluating. “Monitoring for me is something that we can report very quickly and we can get results,” she said, noting that she believes that PEPFAR did an extraordinarily good job of this. While monitoring was target oriented and not designed to measure impact, “in every year we could actually say exactly what had been done and how many people had been reached.” Evaluation, she said, is a high-level activity with a different purpose. Rugg cautioned that program monitoring data are important, but that “you have to know why you collect every single piece of information.” A lot of the information that PEPFAR has in its huge databases is not used for program management decision making, she noted, “and therefore I think the program is hard pressed to show the value added from the money that goes into that program monitoring.” Black added that, “I would say almost any program that is worth doing is worth evaluating or monitoring. I do respect there are differences. Getting information to improve the program is important for any program.” Evaluation, he noted, does not always have to be about mortality impact or health outcome.
In response to a question from Sir George Alleyne, chancellor of the University of the West Indies and a member of the workshop planning committee, about whether all programs were evaluable, Rugg said that programs such as the UN Development Programme or WHO are evaluated not for the purpose of determining whether they should continue but to make them better, which is a different type of evaluation. Whitty said that evaluability is not a yes/no proposition but a spectrum that ranges from the obvious to the impossible. “Where you put the cutoff will depend on your resources and the importance of the question [you are trying to answer],” he said. “There are some things that are easily evaluable but probably not worth evaluating largely because they are not going to be done again.”