Read "Evaluation Design for Complex Global Initiatives: Workshop Summary" at NAP.edu

Page 87 Cite

Suggested Citation:"10 Lessons from Large-Scale Program Evaluation on a Not-Quite-as-Large Scale." Institute of Medicine. 2014. Evaluation Design for Complex Global Initiatives: Workshop Summary. Washington, DC: The National Academies Press. doi: 10.17226/18739.

×

10

Lessons from Large-Scale Program Evaluation on a Not-Quite-as-Large Scale

Important Points Made by the Speakers

Other concurrent programs in target regions can complicate the attribution of effects to a smaller-scale intervention.
All phases of an intervention can be treated as learning opportunities for evaluators.
Evaluation of smaller-scale interventions during their rollout can provide valuable cause-and-effect data.
New technologies can boost data quality and control and enable automated data harvesting and analysis.

Many of the lessons learned from evaluations of large-scale, complex multi-national initiatives can be applied as well to evaluations of smaller scale or less complex interventions. In one of the four concurrent sessions, presenters examined several of these interventions in areas of overlap with issues discussed during the rest of the workshop.

SAVING MOTHERS, GIVING LIFE STRATEGIC IMPLEMENTATION EVALUATION

The Saving Mothers, Giving Life program is a global, public–private partnership in which a consortium of six institutions, including the U.S. government, is working to reduce maternal mortality by 50 percent in four

Page 88 Cite

Suggested Citation:"10 Lessons from Large-Scale Program Evaluation on a Not-Quite-as-Large Scale." Institute of Medicine. 2014. Evaluation Design for Complex Global Initiatives: Workshop Summary. Washington, DC: The National Academies Press. doi: 10.17226/18739.

×

districts each in Uganda and Zambia. These two countries were chosen for phase one of this program, explained Margaret Kruk, assistant professor in health policy and management at Columbia University’s Mailman School of Public Health, because they were already committed to maternal mortality reduction, had existing strategies to reduce maternal mortality, and were supportive of accelerating their programs. She noted that, after she and her team of five researchers and four faculty were commissioned as an independent evaluator of the implementation phase of the program, it took them “2 to 3 months of intense work and many trips to the countries just to describe this highly complex program.” The goal of this evaluation, she added, was to inform the scale-up of this program.

One unusual aspect of this program was that it relied on CDC and USAID contractors who were already in country and who had worked in the PEPFAR program, which meant that the infrastructure already existed for getting the program up and running. “Almost overnight, they were able to turn around their existing programs to deliver these new services or support delivery of these services that were being delivered through government health clinics,” said Kruk.

The evaluation examined 28 discrete activities conducted by the program in four broad areas: (1) increasing demand for services, (2) improving access to services, (3) improving the quality of services, and (4) strengthening the overall health care delivery system in the target districts. Kruk noted that one confounding factor was that other programs were ongoing in the target districts that were involved in improving access to and delivery of health services, whether it was maternal health, HIV, or child health. “It’s an incredibly crowded environment in which to work both from a logistics point of view and from an attribution point of view,” said Kruk. “If there is change, what part of it comes from our program versus the many, many other things that are going on?”

The aims of the evaluation, said Kruk, were to assess the extent and fidelity of the implementation of the Saving Mothers, Giving Life interventions, to assess how the partnership was functioning as a global coalition, and to identify best practices and barriers to success to improve the effectiveness of the scale-up. For an evaluation framework, the team took a traditional implementation evaluation framework and added elements to capture systems dynamics. “It was very clear that something this large is going to have ripple effects, and there are going to be nonlinearities and all sorts of complex effects,” said Kruk. The evaluation took 1 year to complete and cost $1.6 million.

The issue of attribution was of particular interest to the evaluation team. “This is a massive program, so how dramatically does it change the trajectory of change that is already going on in Africa?” asked Kruk. “We know mortality is declining already. How much is this program shifting that

Page 89 Cite

Suggested Citation:"10 Lessons from Large-Scale Program Evaluation on a Not-Quite-as-Large Scale." Institute of Medicine. 2014. Evaluation Design for Complex Global Initiatives: Workshop Summary. Washington, DC: The National Academies Press. doi: 10.17226/18739.

×

curve even more?” To answer this question, the evaluation compared what the program achieved with what was happening in other districts. However, comparison districts were not included in the original evaluation design, so the team conducted post-test exit surveys with women who delivered in health facilities along with satisfaction surveys and obstetric knowledge tests to providers in both program and noncontiguous comparison districts. The evaluation was not funded to conduct a population survey, so the evaluation team conducted 80 focus group discussions with women to help identify any remaining barriers that inhibited or prevented women from using the program’s services.

Kruk commented briefly on several aspects of the quantitative analysis phase of the evaluation, pointing out, for example, that there were challenges with measuring fidelity and quality. When all was said and done, the evaluation found that Saving Mothers, Giving Life increased provider knowledge by about 10 percent in both Uganda and Zambia after a substantial amount of training and expense. “That was a lot of investment for a 10 percent knowledge gain,” said Kruk. Provider confidence, the providers’ rating of quality, and the women’s rating of quality showed marked increases in Uganda but little or no change in Zambia, though the program did increase women’s satisfaction in Zambia but not Uganda.

Why did the program work better in Uganda despite similar monetary investments? The evaluators spent a great deal of time pondering that question and realized that the districts in Uganda were contiguous with Kampala, which enabled doctors in those districts to reach out to their better trained and better equipped colleagues in the nation’s capital. In addition, Uganda made a greater investment in what Kruk called “active ingredients”: vouchers for care, “mama kits” to offset the cost of care for women, a bigger health workforce, more extensive training and mentoring, and upgraded infrastructure.

One of the most important conclusions from the evaluation was that the program is too complicated. “There is no way this 28-point model will be replicated in the same way,” said Kruk. “It’s just too big.” What the program should focus on, she said, are its active ingredients—the few things that made a significant difference when applied as a mutually reinforcing set of actions. In fact, one of the evaluations’ recommendations was that the program should think in terms of health system packages, not isolated interventions. Core health system investments, she added, create a culture of competence. The evaluation also identified so-called last mile women—those who are isolated and have the hardest time getting to a health care facility. Another recommendation was that the program needs to commit to a duration of 5 years with a transition plan that clarifies the roles and responsibilities of partners and governments. Finally, the evaluation recommended that training is not enough. “We love capacity building and train-

Page 90 Cite

Suggested Citation:"10 Lessons from Large-Scale Program Evaluation on a Not-Quite-as-Large Scale." Institute of Medicine. 2014. Evaluation Design for Complex Global Initiatives: Workshop Summary. Washington, DC: The National Academies Press. doi: 10.17226/18739.

×

ing,” said Kruk. “That’s something we know how to do. It’s the backbone of global health assistance, but we don’t think it’s working well enough for the money spent, according to our findings.”

In closing, Kruk said that one recommendation an evaluator should never make is to have more evaluation. Instead, the evaluation team noted the importance of treating the next phase of the project as a learning opportunity.

AVAHAN—REDUCING THE SPREAD OF HIV IN INDIA

The goal of the Bill & Melinda Gates Foundation’s Avahan program was to demonstrate that it was possible to scale a program within target groups, with India chosen as the demonstration country because of the alarming rise of HIV in India and the inadequacy of that country’s response to the epidemic. At the time, there were no adequate models for a large program aimed at women sex workers or men who have sex with men, said Padma Chandrasekaran, previously at the Bill & Melinda Gates Foundation and now a member of the executive committee of the Chennai Angels investment group. Nonetheless, there was no doubt at the foundation that the proposed interventions would lead to impact. The other key assumption was that responsibility for the program would eventually transition to the Indian government, because it was not feasible for any private foundation to fund the program indefinitely.

Because of the enormity of the problem and the lack of infrastructure in India, the foundation committed at least $200 million to the program, with 17 percent of the money going to capacity building. The program also committed 10 percent of funds to advocacy and policy change, because the environment in India was hostile toward HIV and high-risk groups. The implementation programs, said Chandrasekaran, were large and complex and included efforts to distribute treatments for sexually transmitted infections and community mobilization to get hidden populations into clinics.

The evaluation effort consisted of separate design and implementation teams with oversight provided by an evaluation advisory group. The design group developed a detailed question tree that aimed to measure the scale, coverage, quality, and cost of services; the impact of Avahan on the epidemic in India; and the program’s cost-effectiveness. Initially, the intent was to use only program data, but the evaluators realized the necessity of using government data, too. Chandrasekaran noted that the availability of data was graduated across districts in India. For example, 4 districts had general population studies, while 29 districts had data from behavioral studies from the core high-risk groups.

Challenges included data collection among high-risk groups and getting data out for analysis, said Chandrasekaran. “Our evaluation grantees

Page 91 Cite

Suggested Citation:"10 Lessons from Large-Scale Program Evaluation on a Not-Quite-as-Large Scale." Institute of Medicine. 2014. Evaluation Design for Complex Global Initiatives: Workshop Summary. Washington, DC: The National Academies Press. doi: 10.17226/18739.

×

were international grantees, but they could not work in the country without local participatory institutions, and the local participatory institutions felt extremely possessive about the data that they had collected.” The solution was to create incentives to encourage data sharing, including funding for workshops, training on how to write papers, and support for journal supplements. The result, said Chandrasekaran, was that once local institutions got their first publication out, data sharing with the international grantees became an easier proposition. In addition, all international grantees shared authorship with the local institutions that generated the data.

Chandrasekaran finished her presentation with a brief discussion of the program’s scorecard. One positive point was the program included evaluation in its mission from the start, which allowed the design team to develop a clear theory of change, a clear theory of action, and a prospective design. The program was data rich, she said, and monitoring data was used to good effect. There was effective in-country evaluation capacity building, so much so that the Indian government conducted what Chandrasekaran characterized as a good, formal evaluation of its own HIV programs once the Avahan evaluation results were released. That evaluation, she added, was published in a premier journal and influenced the design of subsequent programs. “That was something that had never happened in the country before,” she said. Finally, all of the foundation’s data have been deposited in the public Harvard University Dataverse.

As far as what could have been done better, Chandrasekaran noted the evaluation was too costly and in retrospect could have been designed to be less expensive. In addition, the evaluation effort could have provided more up-front support for the government to collect surveillance data to better support the foundation’s data collection. There were also several missed opportunities for implementing evaluations during rollout. As an example, she said that it would have been interesting to study what kind of drop-in centers work best with different target groups. “These are questions that could have had a short duration and provided cause-and-effect data,” she said in closing.

EQUIP—EXPANDED QUALITY MANAGEMENT USING INFORMATION POWER

The EQUIP program, explained Tanya Marchant, an epidemiologist at the London School of Hygiene and Tropical Medicine, is designed to implement and evaluate the effect of a quality improvement intervention implemented at district, facility, and community levels designed to get all of the actors at the district level to work together to improve maternal and child health. The program targets demand for and supply of health care for mothers and newborns simultaneously, and what is most important

Page 92 Cite

Suggested Citation:"10 Lessons from Large-Scale Program Evaluation on a Not-Quite-as-Large Scale." Institute of Medicine. 2014. Evaluation Design for Complex Global Initiatives: Workshop Summary. Washington, DC: The National Academies Press. doi: 10.17226/18739.

×

about the program’s approach, said Marchant, is that it “supports quality improvement with high-quality, locally generated data that is timely and available at regular periods.”

Marchant noted that EQUIP is a small-scale project compared to the others being discussed at the workshop, but that it is a critical project nonetheless. The program operates in two districts in southern Tanzania and two districts in eastern Uganda. By relying on existing infrastructure organized within district-level health systems, EQUIP was able to engage with district health management teams, which in turn were able to support subdistrict, local quality improvement processes. The conceptual framework, Marchant explained, is built on the hypothesis that district-level health facilities are the best places to target quality improvement, but that communities are the best place to affect uptake of services.

One strength of this project is that it has continuous household and linked health facility survey data from all of the districts throughout the intervention period. The data are exported into report cards so information can be reported back to the facilities, communities, and districts simultaneously. “All three of these levels of actors have access to the same evidence, all of which is about them and their environment,” explained Marchant, who added that the evidence is also fed back to the national level to foster engagement with the program.

The evaluation had four objectives: assess the effects of the intervention on the use and quality of service provision for maternal and newborn health; estimate the cost and cost-effectiveness of the intervention; assess the feasibility and acceptability of the intervention; and model the potential impact on mortality. While the continuous stream of data is key to the evaluation, contextual data is also important, and the program has a prospective contextual tracking process in place.

The evaluation has a quasi-experimental design that compares continuous household and health facility surveys in districts that participate in the EQUIP intervention to comparison districts that are not participating in the intervention in each country. The main difference between the intervention and comparison districts is that the intervention district has the quality management system with report cards generated from continuous survey data. The comparison districts receive a straightforward 100-page report annually that is a tabulation of indicators generated by the continuous survey. “I don’t think it would be ethical to do such intensive data collection and not share anything with the comparison districts, but there is no facilitation,” said Marchant. In response to a question about whether the EQUIP evaluation was an evaluation rather than a monitoring activity, Marchant said that it tested whether continuous surveying and providing feedback drives quality improvement more than just continuous surveying alone.

Expanding on the nature of the continuous surveys, Marchant said

Page 93 Cite

Suggested Citation:"10 Lessons from Large-Scale Program Evaluation on a Not-Quite-as-Large Scale." Institute of Medicine. 2014. Evaluation Design for Complex Global Initiatives: Workshop Summary. Washington, DC: The National Academies Press. doi: 10.17226/18739.

×

they are run in comparison and intervention districts for 30 months. They are based on an idea from the more traditional intermittent, large-scale perceptual surveys—such as DHS—that a rolling survey could have a sufficient sample size to report on a core set of indicators that the country was interested in at annual intervals. This would be accompanied by much smaller and more focused geographical analysis at more frequent intervals. One challenge to implementing this type of survey in a program such as EQUIP is that it requires a mechanism in place to support it over a 30-month period.

The household survey samples 10 clusters, with 30 households per cluster, from the entire district each month. The survey includes interviews with household heads, a household roster, and an interview with each woman age 13 to 49 about her health care and her fertility history, with a special module for any woman who had given birth in the last year. Marchant noted that the data can be aggregated for any number of consecutive months. Data collection also includes a complete census of all health care facilities in each district every 4 months with an assessment of service provision and in-depth interviews with midwives about the last birth they attended.

Every 4 months, the project team runs the data through an automated system that calculates indicators and creates the report cards for use in the intervention. Marchant noted that automation enables the project team to generate the report cards and go back into the field to provide feedback in 4–6 weeks. The report cards are discussed by staff and community members in scheduled meetings at the district health facilities to determine what the facility and the community can do to make improvements. These are run independently of the program. “EQUIP is there just to bring the groups together and give them high-quality, local information,” Marchant said.

In the end, the EQUIP team found that continuous surveys are feasible to use and, if properly designed, can be managed with one data manager on each team who is supported from a distance. However, continuous surveys require continuous field work, so the team tried to avoid scheduling surveys during the rainy season or during the agricultural season. The use of personal digital assistants was incredibly important, Marchant said, because they boost data quality and control and because they enable automated data harvesting and analysis. It was also important to keep the questionnaire content up to date and internally consistent. In this case, there were few indicators of newborn health when EQUIP started, and it was important to add those indicators as the project proceeded.

One lesson the team learned regarding continuous feedback was that it required more facilitation than expected. “The people who are very good at motivating, who are committed, who are great community or facility members, are not necessarily the same people who are good at interpreting

Page 94 Cite

Suggested Citation:"10 Lessons from Large-Scale Program Evaluation on a Not-Quite-as-Large Scale." Institute of Medicine. 2014. Evaluation Design for Complex Global Initiatives: Workshop Summary. Washington, DC: The National Academies Press. doi: 10.17226/18739.

×

graphs and interpreting limitations and strengths of population household surveys or facility surveys,” said Marchant.

Responding to a question about the cost of continuous surveying, she said that her team budgets $20 per household, and in Tanzania the cost was $17 per household or $7,200 per district for the entire study period. Another participant noted that this is an expensive proposition given that Tanzania spends $30 per capita annually on health services.

Evaluation Design for Complex Global Initiatives: Workshop Summary (2014)

Chapter: 10 Lessons from Large-Scale Program Evaluation on a Not-Quite-as-Large Scale

Welcome to OpenBook!

Get Email Updates