Read "Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary" at NAP.edu

« Previous: 3 Designing an Evaluation That Incorporates the Guiding Principles of Coordination, Harmonization, and Capacity Building

Page 77 Cite

Suggested Citation:"4 Designing an Impact Evaluation with Robust Methodologies." Institute of Medicine. 2008. Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary. Washington, DC: The National Academies Press. doi: 10.17226/12147.

Page 78 Cite

Page 79 Cite

Page 80 Cite

Page 81 Cite

Page 82 Cite

Page 83 Cite

Page 84 Cite

Page 85 Cite

Page 86 Cite

Page 87 Cite

Page 88 Cite

Page 89 Cite

Page 90 Cite

Page 91 Cite

Page 92 Cite

Page 93 Cite

Page 94 Cite

Page 95 Cite

Page 96 Cite

Page 97 Cite

Page 98 Cite

Page 99 Cite

Page 100 Cite

Page 101 Cite

Page 102 Cite

Page 103 Cite

Page 104 Cite

Page 105 Cite

Page 106 Cite

Page 107 Cite

Page 108 Cite

Page 109 Cite

Page 110 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

4 Designing an Impact Evaluation with Robust Methodologies This chapter summarizes workshop discussions on methodological is- sues related to impact evaluation design for the Presidentâs Emergency Plan for AIDS Relief (PEPFAR) and is divided into three sections. In the first section, a diverse set of case studies of conceptual models and methodologi- cal approaches are presented from previous large-scale evaluationsâfrom the World Bank, the Abdul Latif Jameel Poverty Action Lab at the Massa- chusetts Institute of Technology (Poverty Action Lab), the UK Department for International Development (DFID), the Cooperative for Assistance and Relief Everywhere, Inc. (CARE), and The Global Fund to Fight AIDS, Tuberculosis, and Malaria (The Global Fund). In the second section, meth- odological challenges and opportunities of impact evaluation are described for the measurement of outcomes and impacts specific to human immuno- deficiency virus/acquired immunodeficiency syndrome (HIV/AIDS), for the measurement of more general outcomes and impacts, for attribution and accounting, and for the aggregation of impact results. The third section summarizes themes common to the approaches. Conceptual Models and Methodological Approaches: Case Studies Impact evaluations require the development of a conceptual model. The model must be defined, the inputs and outcomes measured, and assump- tions and conversion factors determined. For prevention of mother-to-child transmission of HIV (PMTCT), noted speaker Sara PacquÃ©-Margolis of 77

78 EVALUATING THE IMPACT OF PEPFAR the Elizabeth Glaser Pediatric AIDS Foundation, there is a clear, logical pathway between access to services, counseling and testing, test results, prophylaxis by women and infants, and aversion of infections. Assumptions and conversion factors to be determined for PMTCT can include questions like the following: What regimens are taken and how effective are they? Are they actually consumed and when? What is the rate of transmission during labor and delivery? What is the rate of prevention of infections in HIV-negative women who come in for counseling? What is the level of in- fection transmitted through breast milk? Speaker Carl Latkin of the Johns Hopkins School of Public Health cautioned that although models of change are needed to guide interventions, sometimes they donât explain findings. Models are practical heuristics but should not be blinders, he noted; we should not let models narrow the way we look at change. Impact evaluations also require the use of methodological approaches. These can include quantitative, qualitative, and participatory methods and theory-based program logic. Examples of impact evaluation methods, pro- vided by speaker Mary Lyn Field-Nguer of John Snow, Inc., include client satisfaction interviews and surveys, exit interviews, mystery clients, targeted intervention research, focus groups, and key informant interviews. The following case studies describe the experiences from evaluations of five HIV/AIDS assistance programs run by the World Bank, Poverty Action Lab, DFID, CARE, and The Global Fund. Conceptual models and different evaluation methodologies are described in the context of each study. World Bank Evaluation of HIV/AIDS Assistance Programs Workshop speaker Martha Ainsworth, lead economist and coordinator of the Health and Education Evaluation Independent Evaluation Group at the World Bank, described the approach and methodologies used in an in- dependent evaluation of the World Bankâs HIV/AIDS assistance programs. The evaluation assessed $2.5 billion of World Bank investments in HIV/ AIDS prevention, care, and mitigation programs between 1988 and 2004 in 62 developing countries. Two objectives of the evaluation were defined: (1) to evaluate the development effectivenessâor relevance, efficiency, and efficacyâof HIV/AIDS assistance in terms of lending, policy dialogue, and analytic work at the country level relative to the counterfactual, or absence of a Bank program and (2) to identify lessons to guide future activities. Ainsworth shared the World Bankâs experience in prioritizing what to measure in evaluation. Although the World Bank has a large portfolio of complementary programs in education and agriculture, indicators were nar- rowed down to only those with direct HIV/AIDS outcomes and impacts. In addition, identifying how lessons from completed assistance were still relevant to new approaches posed a challenge, given that three-quarters of

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 79 the HIV/AIDS assistance programs being evaluated were still in progress. In assessing a long-term, ever-changing implementation approach over time, therefore, the World Bank evaluation was designed to select those issues that were common to all projects, such as political commitment, setting strategic priorities, multisectoral responses, ministry of health role, use of nongovernmental organizations (NGOs) in implementation, and monitor- ing and evaluation (M&E). The World Bank evaluated the projects com- pleted in the past and examined those issues relevant to ongoing projects. Through this approach, the assumptions and design of the ongoing portfo- lio were analyzed and prospectively evaluated. The World Bank was able to consider design issues and point out where risks had been mitigated and where problems could be addressed through midstream adjustments. The World Bank evaluation drew on a number of methodological ap- proaches. As Ainsworth noted, the World Bank does not rely exclusively on a single source of information, but rather uses different types of evaluations already occurring in the context of the work, such as midterm reviews, completion reports, and annual reviews. Evaluation methods used include the following: â¢ Results chain documentation: Inputs, outputs, outcomes, and impact of government, the World Bank, and other donor efforts were gathered. â¢ Time lines: The documentation of timing of efforts was collected, although in many activities this type of M&E information is lacking. â¢ Interviews: Some information was elicited from interviews of stake- holders, other donors, people and staff involved on the ground, and govern- ment implementers. â¢ Desk work: The following were collected and analyzed: literature reviews; archival research; interviews on the time line of World Bank re- sponse; an inventory of analytic work; a portfolio review of health, educa- tion, transport, and social protection sectors; and background papers on national AIDS strategies. â¢ Surveys: Surveys were conducted of staff members, audiences for analytic work, project task team leaders, and country directors. â¢ Field work: Project assessments and case studiesâchosen to reflect different levels of experience and where interventions worked or did not workâwere collected and reviewed. For example, a project in Indonesia, canceled because the World Bank intervention occurred before anyone was visibly ill, was chosen for the evaluation, as was a project in Russia, where only policy dialogue and analytic work were conducted.

80 EVALUATING THE IMPACT OF PEPFAR Use of Randomized Controlled Trial Methodologies to Evaluate HIV/AIDS Programs Rachel Glennerster, executive director of the Abdul Latif Jameel Pov- erty Action Lab at the Massachusetts Institute of Technology, described the application of randomized controlled trial methodology to HIV/AIDS program evaluation. She described the advantages and disadvantages of randomized trial methodologies and then discussed the results from two case studies in which randomized methods were used, an evaluation of an HIV education program in Kenya and an HIV status knowledge program in Malawi. Advantages and Disadvantages of Randomized Evaluations To know the true impact of a program, one must be able to assess how the same individual or group would have fared with and without an inter- vention. Because it is impossible to observe the same individual in the pres- ence or absence of an intervention simultaneously, comparison groups that resemble the test group are commonly used. Common approaches for se- lecting comparison groups include a âbefore and afterâ approach, in which the same group of individuals are compared before and after exposure to an intervention, and a âcross-sectionalâ approach, in which, at a single point in time, a group of countries or communities in which an intervention has occurred are compared to a ânon-interventionâ group. However, programs are usually started in particular places at certain times for a reason, and they are usually established with the countries, communities, schools, and individuals most committed to action. Therefore, estimates of program im- pact may be biased because it is difficult to find a comparison group that is equally committed to those where the program was established. This may in part explain why projects typically work well in a few places, but fail when scaled up. In randomized controlled trials, like medical clinical trials, those who receive the treatment and the control group are selected randomly. By construction, those who receive the proposed new intervention are no more committed, no more motivated, no richer, and no more educated than those in the control group. Randomized trials produce results that are freer from bias than other epidemiological studies. Randomized evaluations can be used to test the efficacy of interventions before they are eventually scaled up to the national level. Randomized trials conventionally have been used to look at drug ef- fectiveness, but are also being applied to other areas where they are not commonly used. For example, randomized trials can be used to investigate social patterns, such as what messages are most effective in changing the sexual behavior of young girls.

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 81 There is a perception that randomized evaluations are difficult both to implement and to integrate with what is going on at the ground level, but with innovations in randomization over the past 10 years, randomized studies are less intrusive and less like more formalized clinical trials. Sev- eral mechanisms exist to more naturally introduce randomization into the way a government works or with the way an NGO works on the ground, including the following: â¢ Lottery: Randomization can be introduced through a lottery if a program is oversubscribed. â¢ Beta testing: Randomization can be introduced through small-scale experimentation of methods before scaling up to the national level. â¢ Randomized phase-in over time and space: Capacity or financial constraints may limit the ability to introduce interventions in all com- munities immediately. The order in which a program is phased in can be randomized, allowing for an assessment of effectiveness to be made during the phase-in period. â¢ Encouragement design: Often, national programs that are up and running do not have 100 percent adoption; the impact of such programs can be evaluated by randomly encouraging some people to participate in the program. Several of these mechanisms simultaneously help to address some of the ethical questions surrounding randomized designâthe exclusion of people from having access to care or programs that might save their lives. In the randomized phase-in approach, all individuals will ultimately benefit from the intervention; under the encouragement design, no one is denied care. A disadvantage of randomized evaluation is that it cannot be done after the fact; it must be implemented with the program. Institutional con- straints are another disadvantage to randomized evaluation that sometimes make it more difficult to engage with partners in an intensive way. One workshop participant noted that randomized controlled trials can be dif- ficult to translate from the individual level to the community level, where interventions are more complex. Glennerster acknowledged that random- ized controlled trials can be improperly designed and can thereby generate incorrect results. Using Randomized Trials to Evaluate HIV/AIDS Education Programs in Kenya Randomized trial methodology was used to evaluate a Kenyan HIV/ AIDS education project, a collaborative effort among the government of Kenya, a local NGO, U.S. universities, and Jomo Kenyatta University in

82 EVALUATING THE IMPACT OF PEPFAR Kenya. The method was used in randomly chosen schools to test a range of education strategies for their effectiveness in getting children to understand messages about the risks of HIV. These strategies included training teach- ers in a new HIV/AIDS education curriculum, reducing education costs to encourage young girls to stay in school, holding debates about whether or not to teach about condoms in primary schools, holding essay competitions about protection from HIV, and telling children about relative infection rates by age, including the dangers of sexual, gift-exchanging relationships between young girls and older men (sugar daddies), the greater likelihood of older men to be infected than younger men, and the greater likelihood of girls to be infected than boys. Upon implementation of each program, the evaluation tracked observed changes in behavior, including school dropout rates, marriage, pregnancy, and childbirth, as determined by community interviews. Follow-up studies are also tracking HIV infection rates under each type of intervention. Results from the trial are shown in Figure 4-1. FIGURE 4-1â Impacts of alternative HIV/AIDS education strategies on girlsâ behav- ioral outcomes. NOTE: Â´Indicates that the difference with the comparison group is significant at 10 percent. 4-1 SOURCES: Duflo et al., 2006, and J-PAL, 2007. Bitmapped--cannot remove background

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 83 The teacher training in the national curriculum had little effect on school dropout rates, marriage, and childbirth, although girls from schools where the training was conducted were more likely to be married if they had a child, and there was a slight effect on increasing tolerance of those with HIV in schools that underwent the training. Reducing the cost of education was found to be an effective strategy for reducing dropout, marriage, and childbirth rates. Education programs about the dangers of sexual relations with older men, or sugar daddies, led to a 65 percent drop in pregnancies or childbirths with older men and no increase in childbearing with younger men. Self-reported data indicated a shift between having relationships with older men to having relationships with younger men. Self-reported data from the boys in the group indicated increased condom use, potentially be- cause boys had learned that girls were much more likely to be infected than boys. Results of the debate and essay interventions remain to be tested with outcome data; currently, only self-reported data exist, which can be very biased. On the basis of the costs of the interventions, the evaluators were able to calculate a cost-per-childbirth-averted rate for each intervention, with the education program about older men being the most cost-effective intervention, at $91 per childbirth averted, compared to $750 per childbirth averted for interventions to reduce the cost of schooling. Using Randomized Trials to Evaluate HIV Status Knowledge Programs in Malawi Although half of HIV/AIDS prevention spending in Africa focuses on HIV testing, many of those tested do not come back to pick up their results. A study conducted in Malawi used randomized evaluation to test the im- pact of campaigns promoting knowledge of HIV status (Thornton, 2007). Only 40 percent of those tested for HIV returned to collect their results, but the study showed that a small incentiveâonly 10â20 cents, or a small fraction of the daily wageâwas enough to increase results collection by 50 percent. The study went on to test whether or not knowledge of status changed behavior. In follow-up interviews with those who had and had not received encouragement to pick up their test results, people were given the opportunity to buy subsidized condoms and the money to buy them. In comparing the treatment group (those encouraged to and therefore more likely to know their status) with the control group (those who were not encouraged and thus less likely to know their status), the study found that the knowledge of HIV status had virtually no impact on whether people purchased subsidized condoms, even when they were given the money to buy them. Only HIV-positive individuals in long-term partnerships were more likely to buy condoms if they knew their status, and few bought subsidized condoms.

84 EVALUATING THE IMPACT OF PEPFAR Glennerster cautioned that if randomized methodologies are not used and if studies survey only the sample that returns for test results, it may appear as if knowledge of status is effective in reducing HIV incidence. A randomized methodology allows researchers to tease out proper at- tribution for the perceived success of a program. Glennerster also noted that the use of plausible correlation approachesâsuggested by workshop speaker Paul De Lay of the Joint United Nations Programme on HIV/AIDS (UNAIDS) as a more practical methodology applicable to work at the country levelâwithout doing a full trial can also lead to the wrong policy conclusion. With millions of dollars being invested in knowledge-of-HIV- status programs, it is worth testing whether they are effective in reducing incidence, she concluded. DFID Evaluation of the National HIV/AIDS Strategy Speaker Julia Compton, senior evaluation manager of the Evaluation Department, DFID, described a recent evaluation of the UK national HIV/ AIDS strategy, âTaking Action,â a comprehensive and far-reaching $3 bil- lion, 5-year effort launched in 2004, which included a substantial overseas investment component. This national strategy cuts across the UK govern- ment and involves six priority areas. The following four objectives were defined for the evaluation: â¢ Developing recommendations for improving implementation â¢ Developing recommendations for how to measure success: indicators â¢ Developing recommendations for a future UK strategy on HIV and AIDS â¢ Developing recommendations for other UK government strategies Through an extensive consultative process, DFID identified 13 evalu- ation questions focusing on inputs and processes specific to decisions, for example, the usefulness of spending targets and the effectiveness of country-led approaches. The evaluation used several methodologies. Seven case studies of coun- tries were conducted and three working papers were developed to gain an understanding of spending, M&E frameworks, and challenges in reaching women, young people, and vulnerable groups. The evaluation was a heavily consultative process; in fact, the process of communications and consultations during the evaluation process may have had greater impact on changes in the strategy than the actual evalua- tion data, remarked Compton. The process of evaluation motivated DFID to make changes needed to achieve positive results. Compton cautioned that concentrating too narrowly on the dataâat the expense of communi-

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 85 cation and understanding what policy makers wantâmay result in missed lessons from evaluation. A major challenge to the DFID evaluation was the declining quantity and quality of data collected at projects in-country. Because DFID relies heavily on country-led approaches and country systems to collect data, this was a major constraint to the evaluation. CARE Evaluation of Womenâs Empowerment Programs Kent Glenzer, director of the Impact Measurement and Learning Team at CARE, described the approach and methodology of a multiyear evalua- tion of the impact of womenâs empowerment interventions. The evaluation is a $500,000 effort assessing interventions at field sites in more than 40 countries, plus 900 other projects through secondary data. This evaluation is being conducted to inform organizational change at CARE, a private, international humanitarian organization with a focus on fighting global poverty. CARE uses a literature-based theory of social change and defines the concept of empowerment as a process of change in womenâs agencies, social structures, and relations of power through which women negotiate claims and rights. CAREâs approach for evaluating complex systems, such as womenâs empowerment, involves bringing together expertsâinternal, external, and localâand coupling M&E with project implementation. In CAREâs experience, local actors know and understand systemic changes better than external experts; therefore, CAREâs role is to bring actorsâ most importantly women and girlsâtogether over the long term to discuss systems changes, develop hypotheses, and build collective knowledge about change. CARE is tracking change across 23 categories of womenâs empower- ment. Indicatorsâincluding those developed by local men and womenâare developed at multiple levels for each category and include measures of individual skills or capabilities; measures of structures such as laws, family and kin practices, institutions, and ideologies; and measures of relational dynamics, such as those between men and women and between the power- ful and less powerful. Although across the sites the indicators are differ- ent, broad patterns can be compared relating to where and how change is happening. The following attributes of a successful evaluation approach, from the perspective of CARE, were outlined: â¢ Evaluation is a long-term learning experience that should unite relevant actors. â¢ Evaluation should be flexible enough so that different dependent

86 EVALUATING THE IMPACT OF PEPFAR variables can be specified in different contexts, but should be designed to permit comparison of variables across contexts. â¢ Centrally planned, mixed-method evaluation designs work best. The Global Fund Evaluation Stefano Bertozzi, member of the Technical Evaluation Reference Group of The Global Fund, described a 5-year evaluation plan for The Global Fund, which will focus on 8 countries in depth, plus 12 others using sec- ondary information. The evaluation is a âdose-response design,â meaning it will look for correlations between intensity of project implementation and changes in trends of the HIV/AIDS epidemic in terms of survival of infected individuals and prevention of new infections. The plan includes evaluation of the following three major topics: â¢ Organizational efficiency: Operations, business model, and gover- nance structure in The Global Fund, which are based on technical reviews of country-generated proposals with little country presence other than au- diting firms, will be evaluated. â¢ Partnership environment effectiveness: Country and grant perfor- mance will be evaluated, including the effectiveness of mobilization of tech- nical assistance and effectiveness of country-coordinating mechanisms. â¢ Health impact: The health impact of The Global Fund on the three diseases it covers (HIV/AIDS, TB, and malaria) will be evaluated. Macro International Inc., Harvard University, the World Health Or- ganization (WHO), and Johns Hopkins University are implementing the evaluation, and data collected by MACRO through Demographic and Health Surveys-Plus (DHS+) will serve as the baseline assessment. The limited budget of the evaluation will not permit the conduct of large-scale surveys. Methodological Challenges And Opportunities in Evaluating Impact Workshop participants described methodological challenges and oppor- tunities in evaluating the impact of PEPFAR, including those in measuring outcomes and impacts specific to HIV/AIDS, measuring broader impacts and outcomes, attributing results, and aggregating the results of impact evaluation. The discussions were wide-ranging and touched on many chal- â Demographic and Health Surveys including HIV prevalence measurement are known as âDHS+.â

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 87 lenges and opportunities, but were by no means an exhaustive or prioritized list of considerations or an in-depth analysis of any one of them. Measuring HIV/AIDS-Specific Outcomes and Impacts HIV/AIDS-specific outcomes and impacts include the measurement of HIV prevalence, incidence, infections averted, mortality rates, development of drug resistance, orphanhood prevention, behavioral change, and stigma and discrimination. Workshop participants described methodological chal- lenges and opportunities in each of these areas. Measuring Change in HIV Prevalence HIV prevalence is the proportion of individuals within a population infected by HIV during a particular time. It is a function of both the death rate of those already infected and the rate at which new infections occur. Repeated surveillance of pregnant women at antenatal clinic (ANC) sentinel sites is currently the most common method for measuring changes in HIV prevalence. Workshop speaker Theresa Diaz of the U.S. Centers for Disease Control and Prevention (CDC) pointed out some of the challenges and limi- tations of using this approach. Comparison with nationally representative household-based surveys shows that the ANC surveillance method tends to overestimate prevalence, she said, because ANCs are predominantly urban. In addition, the ANC methodology does not take into account other fac- tors, such as the change in use of clinics over time, increased survival, or immigration, which can lead to a change in HIV prevalence. The method is also unreliable for measuring prevalence in areas where epidemics are concentrated in high-risk groups, such as Vietnam. Diaz noted that a number of new tools are now becoming available to analyze prevalence trends more effectively. CDC uses a suite of methods (chi-square, linear, trend, linear regression, and nonparametric methods) for analyzing prevalence trends using only the most consistent ANC sites and the most recent data. In addition, a second population-based survey of HIV testing will soon be available in some countries to allow analysis of HIV prevalence over time. The collection of data on antiretroviral (ARV) useâ both from ANC sentinel surveillance surveys and from the population-based surveysâwould allow better prevalence data to be collected, in addition to data on coverage. Finally, methods such as respondent-driven sampling are being standardized for collecting HIV sero-prevalence data among high-risk groups. When such methods use the same sampling methodology in the same place over time, trends can be observed.

88 EVALUATING THE IMPACT OF PEPFAR Measuring HIV Incidence Workshop participants described some of the challenges of various methods for measuring HIV incidence, which is the number of new HIV in- fections within a population at risk over a given period of time. Measuring incidence is difficult, noted Diaz, because symptoms of HIV do not appear until years after infection. Cohort studiesâlongitudinal studies of HIV acquisition in a particular group of peopleâare considered the âgold standardâ for measurement of HIV incidence, noted Diaz; however, they may not always reflect the true population incidence, particularly if interventions are taking place within the cohort. Speaker Geoff Garnett of Imperial College, UK, noted a further disadvantage: there may be substantial loss of cohort participants to follow- up. Discussant Timothy Fowler of the U.S. Bureau of the Census (BUCEN) stressed that another major gap in HIV incidence measurement is the lack of empirical data on incidence by age and sex. Laboratory assays (specifically, the branched gp41 peptide, or BED, test) can also be used to distinguish recent infections from long-term infections on the basis of relative levels of anti-HIV antibodies, noted Diaz, but they tend to overestimate the propor- tion of most recent infections under certain circumstances, for example, in people who have taken ARV drugs immediately before the test. Finally, the potential for spread of infection and future infection can be measured through modeling techniques, noted speaker Garnett. Factors such as contacts within the population, duration of infectiousness, trans- mission probability, heterogeneity in risk, mixing patterns, and different types of contact may be included in the model. However, modeling tools are limited by their inability to measure accurately the risks within populations in order to determine timing within an epidemic. Models may not predict reliably at the threshold where spread of infection becomes epidemic, and there is greater sensitivity of a system to small changes. Speakers described a number of improvements in methodologies for measuring incidence that may provide future opportunities. Diaz mentioned that CDC has developed an adjustment formula for the HIV incidence labo- ratory assay on the basis of several populations and is now validating the approach in other populations. CDC is also developing a second laboratory assay to increase the specificity of HIV testing so that those individuals who contribute to âfalse-recentâ test results can be more easily excluded. New developments in modeling tools can help to measure incidence. With the availability of the population-based HIV sero-survey data (DHS+), stated Diaz, new models based on these surveys will be able to provide important age-, sex-, and geography-specific incidence information that the ANC sur- veillance data cannot provide. In fact, noted Garnett, a promising method is to calculate incidence by accounting for mortality in successive prevalence

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 89 (DHS+) surveys. He also said that national incidence can be calculated in- directly by fitting models to prevalence data and back-calculating incidence on the basis of data on the survival of HIV-infected people. Measuring Change in Infections Averted Speaker Rand Stoneburner, an independent consultant, defined âin- fections avertedâ as the difference between expected and actual annual incidence, as shown using modeling techniques that can be empirically vali- dated. BUCEN has been charged with estimating the number of new HIV infections that are prevented during the first 5-year phase of PEPFAR using a population projection model that takes both population change and HIV/ AIDS impact into account, said discussant Fowler. The data and assump- tions behind the model are taken from the UNAIDS reference group on es- timates, modeling, and projections, noted Fowler, and consider factors such as the survival of HIV-positive individuals and whether or not they receive antiretroviral therapy (ART). The BUCEN model has been used to generate baseline estimates of numbers of new cases for the years 2005â2010 and will compare those to what actually occurs, noted speaker Diaz. But measuring a ânoneventâ such as change in infections averted can present significant methodological challenges, observed Diaz. When mea- suring infections averted in the general population, she noted, one has to assume that people would not have died of HIV infection before dying of other causes and would not have contracted HIV at a later time, but these may not be valid assumptions. Although modeling has been used to some extent to measure the effect of interventions on averting infections, it has limitations because of the gaps in data available in developing countries, she noted. Speaker Garnett also pointed out that large data gaps exist spe- cifically in the areas of efficacy measurement in different epidemiological contexts and in the translation of efficacy to large-scale interventions. For example, noted Diaz, CDCâs methodology for predicting infections averted in newborns by comparing mother-to-child transmission of HIV with or without preventive ARVs does not consider epidemiological context factors such as breast-feeding practices, efficacy differences among programs and countries, adherence and proper use, and impacts of counseling. Alternative models, population surveys, and cross-country compari- sons were put forward by workshop participants as possible opportuni- ties for more effectively measuring changes in infections averted. Fowler mentioned that an update of the Spectrum model, which uses prevalence data to calculate past incidence, considers epidemiological contextual fac- tors such as breast-feeding regimes and different ART regimes (UNAIDS, 2007b). Speaker David Gootnick of the U.S. Government Accountability Office cited another model developed by the Futures Group that can be

90 EVALUATING THE IMPACT OF PEPFAR used to attribute infections averted to specific interventions, such as partner reduction and male circumcision (Stover and Bollinger, 2006). Stoneburner referred to the use of serial HIV population surveys to complement model- ing approaches and provide further evidence to deduce changing incidence and infections over time. For example, a population survey done in Uganda in 1987â1988 and again in 2004â2005 was used to show reduced HIV prevalence and incidence in the younger generations over time (Stoneburner et al., 1996). Cross-country comparative analyses of HIV dynamics, risk behaviors, and intervention uptake, noted Stoneburner, can also provide insight regarding the relative effectiveness of interventions. Measuring Changes in Survival and Mortality Rates Workshop participants identified a number of challenges in measuring changes in survivalâthe percentage of a group who are alive for a given period of time after diagnosis or treatmentâand in mortality ratesâthe proportion of deaths in an area compared to the population of that area per unit of time. Speaker Diaz pointed out that measuring changes in mortal- ity rate in the general population has raised questions about whether ART decreases mortality, through increasing chances of survival of HIV-infected individuals, or increases mortality, through increased opportunities for viral transmission to others. She added that âlives savedâ and mortality rates may not actually be appropriate outcome measures given that HIV/AIDS treatments may not actually save life, but only delay death. Other challenges of measuring change of mortality relate to gaps in available data. Many patients are lost to follow-up, and frequently population- and hospital-based mortality data exist but are overlooked, noted speakers Diaz and Stoneburner. Mortality and health surveillance systems in general lack support, infrastructure, and validation, stressed Stoneburner. In many countries, he noted, capturing all deaths and ensuring the accuracy of mortality data are a problem. For example, many countries in sub-Saharan Africa are registering only a fraction of all deaths; never- theless, appropriate analysis of such data may still be useful in measuring mortality changes over time. Accurately ascertaining the cause of death in the general population is a further problem, noted Diaz. In some countries, however, such as Botswana, both the accuracy and capture of mortality data are thought to be highâ90 percent to 95 percent. The variability in data capture and accuracy from country to country may contribute to a disconnect between observed mortality and modeled mortality and may indicate a need to adjust models to better replicate empirical data, observed Stoneburner. He shared an example from Botswana in which a model pre- dicted more deaths than the registration data reported, likely a result of overestimating HIV incidence due to too short an assumed survival in the

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 91 model. Speaker Jonathan Mwiindi of the Kijabe HIV/AIDS Relief Program in Kenya asked a question about a case in which increased mortality rates of patients coinfected with HIV and TB during early ARV treatment were perceived to relate to ARV treatment failure rather than the inadequate recognition and treatment of TB. Stoneburner responded that better use of existing TB and ARV surveillance data, including ARV cohort survival analyses, rather than reliance on population-level vital registration data would better identify risk factors for such adverse outcomes and better guide clinical management. Workshop participants identified a number of opportunities for more effective measurement of mortality and survival rates. Diaz commented that âyears of life addedâ might be a more appropriate measure than âlives savedâ among the HIV-infected population because life may only be pro- longed for a limited number of years. Several speakers emphasized the importance of improving the quality of mortality data among both the infected and general populations. Diaz urged more aggressive pursuit of information on patients lost to follow-up and standardized methods for collecting information from hospital records. Several innovative methods for improving mortality data were suggested. Discussant Fowler mentioned that the health metrics network at WHO has done research on verbal autopsies for following up on deaths in households and determining cause of death. One verbal autopsy scheme under consid- eration is being developed by BUCENâcalled sample vital registration with verbal autopsy (SAVVY)âand involves a census of a sample of households from different areas of the country, with follow-up over a period of time when there have been deaths. Census enumerators return to the households to collect information, which is reviewed independently by medical doctors to obtain a cause of death. Corporate-sector surveillance systems may also provide important data on early impact on mortality of treatment programs, suggested Stoneburner. He shared results from a private-sector mortality attrition study conducted in Malawi showing early impact of ART programs, which was later ex- panded to a larger corporate-sector study gathering data from seven busi- nesses (see Figure 4-2). Stoneburner acknowledged that although the private sector is very selectiveâmore likely to provide access to treatment at the workplace and to target employees who are well educated and highly moti- vated to stay on treatmentâreduced mortality in the private sector may be an important early indicator of expected response in the general population, once enough people have been treated to see a change in mortality. Speakers also emphasized the importance of gathering age-specific and population-based mortality data for measurement of mortality im- pact. Diaz noted that impact on mortality may be useful when measured by concentrating on the young adult population and excluding the most

92 EVALUATING THE IMPACT OF PEPFAR % Died Total employees 2500 3.0% 2000 2.5% ART rollout in mid-2005 Employees 1500 Deaths 2.0% 1000 1.5% 500 0 1.0% 2001 2002 2003 2004 2005 FIGURE 4-2â Private-sector attrition data show evidence of early ART impact on mortality. 4-2 SOURCE: Adapted from Partners In Impact: Results Report, The Global Fund to Fight AIDS, Tuberculosis and Malaria, March 2007. Based on data provided by the National AIDS Commission Trust of the Republic of Malawi, Principal Recipient for the Global Fund programs. obvious, non-HIV-related causes of death, such as accidents and maternal mortality. Stoneburner reinforced the importance of gathering age-specific, population-based mortality data, describing a Botswana study that assessed impacts of ART programs on adults mortality and PMTCT programs on in- fant and child mortality (WHO, 2006). Although effects of ART programs clearly correlated with declining mortality in the 25â54 age category, in the 0â4 age range, no mortality decline was observed despite high use of azi- dothymidine (AZT) by mothers and infants. The unexpected lack of mor- tality decline in children could relate to inaccuracies in mortality capture, but more importantly may relate to factors impeding overall intervention effectiveness, such as the added risk of increased infant mortality related to infant feeding practices. In summary, use of available data can identify important changes in the dynamics of impacts of interventions that would not otherwise be detected through routine M&E tools.

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 93 Measuring Behavioral Change and the Impact of Behavior-Change Interventions Workshop participants discussed challenges and opportunities in mea- suring behavioral change and the impact of interventions that modify risk behaviors. These interventions, noted speaker Latkin, include those to modify sexual behaviors, injection behaviors, and drug-adherence be- haviors. Gaps in data and surveillance are a major challenge in measuring behavioral change. Discussant Caroline Ryan, Office of the U.S. Global AIDS Coordinator (OGAC), pointed out the need for more qualitative data and more behavioral surveillance, such as behavioral sentinel surveillance with biomarkers, to be conducted on a more consistent basis in order to understand the drivers of infection. Tools currently available include the DHS+ studies and assorted simulation models, she said. Although these are providing information on who is affected and what the behaviors are, information on why specific populations are affected is also needed. Preven- tion efforts need to have more heterogeneity to be effective because of the substantial variation within populations, commented Ryan. Evaluation methods are also currently limited in their ability to deter- mine coverage and extent of behavioral change occurring, observed speaker Latkin. While some interventions, such as media communication, result in small changes but large coverage, others, such as behavioral-change coun- seling, result in large behavioral changes with narrower coverage. More information is needed to determine how much behavioral change and how much coverage are needed to change the course of the epidemic. For some interventions, such as adherence to ARV treatment, incomplete behavioral change has consequences that are even worse than no behavioral change, noted Latkin, because moderate adherence could provide a greater selec- tive pressure for the evolution of viral drug resistance compared to poor adherence. Methods for determining such unintended impacts need to be developed. A further challenge of measuring behavioral change and effects of behavior-change modification is the potential influence of factors inde- pendent of the program intervention. As Latkin pointed out, cultural and structural factors may affect a program, leading to a success of an inter- vention in one context and a failure in another. For example, he noted, an identical intravenous drug user intervention used successfully in the United States failed in Thailand because of a specific law on drug use in Thailand. Speaker Stoneburner further reinforced the idea that although changes in HIV prevalence may result from behavior-modification interventions, they may also result from other factors that have nothing to do with the intervention. Frequently, such change comes about not because of a specific targeted intervention from an outside agency, but because of a comple-

94 EVALUATING THE IMPACT OF PEPFAR mentary, indigenous community response, or even from natural epidemic dynamics. For example, in Uganda, declining HIV prevalence correlates with declines in multiple sexual partners, a population-level indicator of behavioral change (Stoneburner and Low-Beer, 2004; MOH and Macro, 2006) that generally occurred before major externally funded HIV inter- ventions. In contrast, in Botswana, extraordinarily high and stable HIV prevalence is associated with high levels of condom use (80 percent) for the past decade, no evidence of declines in multiple sexual partners, and a plethora of resources and externally funded interventions. However, posited Stoneburner, incipient declines in HIV noted since 2004 may have more to do with mortality breaking up sexual networks than interventions. In a further example from the Malawi context, in which a similar pat- tern of declines in HIV prevalence and declines in multiple sexual partners among males was tracked between 1996 and 2004, the association between declines in HIV prevalence and evidence of behavior change is less clear. Despite data suggesting substantial declines in HIV prevalence among youth in urban and semi-urban areas, there is no similar trend in the few rural sites where data are available. Trends in behavioral indicators since 1996 suggest substantial declines in multiple partners overall, but when stratified by residence, the major decline from 1996 occurred among rural rather than urban males. Declines in prevalence among urban youth may be re- lated solely to natural epidemic dynamics, noted Stoneburner, but a more plausible hypothesis is that prevalence declines relate to behavior change preceding the behavior changes observed in rural areas but not captured through the survey. Workshop participants noted several opportunities for evaluating be- havioral change. Latkin urged a shift in measuring change from what is happening at the individual level to what is happening at the social and in- stitutional levels (such as, community, network, or national levels). Change at these levels is necessary to build infrastructure for and sustain risk reduc- tion at the individual level, he noted. Instead of using the individual as the unit of analysis, we should be evaluating a random sample of programs. Latkin also provided guidance on what should and should not be measured to track behavioral change. Knowledge of and contact with programs are indicators that tend not to be associated with behavioral change. Opinions of leaders, impediments to behavioral change, and unintended negative consequences of behavioral change are among the indicators that should be measured. The evaluation of behavioral change should be integrated at multiple levels within PEPFAR as opposed to focused within a program, and it should systematically engage both the scientific community and stakeholders, recommended Latkin. Speaker Garnett highlighted the availability of other tools for evaluat-

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 95 ing behavioral change. Behavior surveys are a useful tool for attributing changes in HIV incidence to specific changes in risk. Such surveys should specifically target younger people 1 or 2 years after sexual debut, he sug- gested. Sampling this demographic can be used to calculate cumulative incidence of HIV and can serve as an early indicator of the success of HIV infection prevention interventions. HIV prevalence changes in response to behavioral change interventions are more marked among young people (ages 15â19) as compared to older people (ages 40â44). Modeling is another instrument for linking behavior-change interven- tions to prevalence and incidence outcomes, noted Garnett. Modeling can simulate the degree of deviation between what can be expected from the natural course of the epidemic and what can be achieved through various interventions. Models can be used to predict what is known as the coun- terfactual, or the outcome that would have occurred had the donor or intervention been absent. Models that combine trends in prevalence and incidence with studies of risk behavior can be a useful tool for retrospec- tively understanding how interventions might have worked to maximize declines in HIV prevalence. Simulation models show maximal declines in prevalence as high-risk behaviors decrease, leading to reduced inci- dence and fewer replacements of those HIV-infected people who die (such models must simultaneously take into account the opposite effect of ART treatment in decreasing the rate of decline in prevalence as the death rate of infected persons decreases). If interventions are successful in changing behaviors, incidence is lower, and declines in prevalence are maximized because people dying are not instantly replaced by people with similarly high-risk behavior. Modeling of the Ugandan HIV/AIDS epidemic has allowed researchers to simulate the effects of various behavioral changesâincreased condom use, delayed sexual debut, and decrease in partner changeâon prevalence. The observed prevalence data best fit a scenario in which all three behaviors changed at once (Hallett et al., 2007). Similarly, in Zimbabwe, modeling the declines in prevalence also indicates that risk behaviors are changing and leading to decreased incidence. These prevalence results in Zimbabwe have been corroborated through randomized controlled trials conducted at two time points. Surveys conducted in conjunction with the trials demonstrate that declines may have resulted from behavioral changes such as foregoing casual sexual partners and reducing simultaneous partners (Gregson et al., 2006). The success of behavior-change interventions is highly contextual, however. The failure of prevalence levels to decline in CÃ´te dâIvoire suggests no impact of interventions on risk behaviors.

96 EVALUATING THE IMPACT OF PEPFAR Measuring Stigma and Discrimination Stigma and discriminationânegative attitudes, beliefs, and actions to- ward people who are perceived to have HIV/AIDS and those associated with themâare an important part of the impact evaluation picture, work- shop participants noted, but methodologies for studying them are limited. Speaker William Holzemer from the University of CaliforniaâSan Francisco urged the development of more rigorous research and data collection ap- proaches. Most of the literature on stigma and discrimination, he noted, is based on anecdotal evidence, testimonials, and a few qualitative studies. Perceptions of a reduction in stigma and discrimination are based, for ex- ample, on observations of increased numbers of patients seeking testing and long lines of people waiting to access ART (Holzemer and Uys, 2004). In a recent review of the literature, not one stigma-reduction intervention trial had any rigorous measure of stigma that could be used to draw a conclu- sion about a particular intervention. Holzemer emphasized the importance of developing scales to measure stigma; the effects of stigma on infected individuals, families, and health care providers; and the effectiveness of strategies for mitigating stigma. Citing reports he had seen suggesting that women from Mozambique who use antenatal services are automatically assumed to be HIV-positive (IRIN, 2007), discussant Fowler also noted the importance of developing measures that would track the extent to which stigma is a factor in patients who seek other services, such as antenatal care. New methods for measuring stigma and new sources of data may provide opportunities that will be useful to the future impact evaluation of PEPFAR, workshop participants said. Speaker De Lay mentioned that a new tool for measuring stigma is now available from the International Planned Parenthood Federation (IPPF et al., 2008). This index includes a measure of self-imposed stigma, which can capture the failure of persons living with HIV/AIDS (PLHAs) to access services because of fear of rejection or perception that their future is too limited to justify attempting to access services (such as education) in the long term. Holzemer referred to other new measures now available that can be valid and reliable instruments for measuring stigma (Holzemer et al., 2007). These instrumentsâreflecting measures of peopleâs perceptionsâinclude 33 factors and are based on the reported frequency of occurrence of verbal abuse, negative self-perception, health care neglect, social isolation, fear of contagion, and workplace stigma based on HIV status. Although few empirical studies exist, a few correlational studies and new sources of data on stigma are emerging, noted Holzemer. Focus group data collected from five African countries have assessed stigmatization of patients by health care workers among more than 1,500 nurses and 1,500

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 97 PLHAs. Studies suggest that among the HIV-infected, stigma has impact on participation in testing, the use of services (such as giving birth at home instead of returning to a clinical facility), adherence to medication, health status, and quality of life (such as loss of social support, isolation, violation, verbal violence, and limiting social interactions). The data indicate clearly that HIV patients are treated poorly by health care providersânurses, physicians, and others. The new studies have shown that stigma also has impact on the quality of work life and quality of life for health care workers and their families. Health workers and their families may be stigmatized by their neighbors because fear of contagion is an underlying cause of stigma. New data are showing that testing, diagnosis, having the disease, physical manifestations of AIDS, status disclosure, suspicion, and rumors are all triggers to the cascade of stigma events. Measuring Changes in Orphanhood Prevention Workshop participants discussed some of the challenges and oppor- tunities in measuring changes in orphanhood preventionâthe prevention of the death of one or usually both parents of a child. Speaker Diaz noted that measurements need to distinguish between children who have lost one parent (single orphans) and children who have lost both parents (double orphans) to HIV. In addition, HIV status should be taken into account in the calculation of years of orphanhood averted. Treatmentâboth of HIV-positive orphans and of HIV-positive parentsâcan have an impact on orphanhood. Diaz pointed out that ART treatment of orphans actually ex- tends years of orphanhood. Discussant Mead Over of the Center for Global Development observed that although treating HIV-positive parents can reduce orphanhood years of existing children by prolonging parentsâ lives, it can also generate years of orphanhood among children who are born to HIV-positive parents during treatment. He also mentioned a limitation in the ability to conduct cost-effectiveness analysis of orphanhood-prevention interventions. No method yet exists for expressing âorphanhood years avertedâ in terms of healthy life years, the usual common denominator for a benefit in cost-effectiveness analysis. Over called for the need to establish a crosswalk between orphanhood years averted and the dollar value of a healthy life year in order to better integrate evaluation of averted orphan- hood into cost-effectiveness analysis. New models are in development to better quantify the impact of treat- ment and prevention in preventing orphaning of children, noted speaker PacquÃ©-Margolis. Diaz also offered guidance on potentially useful indica- tors, including âyears of orphanhood avertedâ and ânumber of children who reach age 18 before the death of a parent whose life is extended by

98 EVALUATING THE IMPACT OF PEPFAR ART.â Differences in overall numbers of orphans within a given time period with and without ART can also be examined, said Diaz. Measuring Change in the Development of Drug Resistance Workshop participants noted the importance of evaluating the develop- ment of viral resistance to drugsâthe evolved capability of HIV to with- stand a drug to which it was previously sensitive. Speaker Diaz stated that two strategiesâboth used by PEPFAR and WHOâare available to measure drug resistance: threshold surveys and therapy monitoring. The threshold survey can be used to assess transmitted HIV infection using blood tested at ANC sentinel surveillance sites. Blood sampled from young women (younger than age 25) in their first pregnancies who are not likely to be in ARV treatment can be used to track the transmission of drug-resistant HIV strains. A second strategy for measuring drug resistance is to sample and monitor patients in ARV treatment from the initiation of therapy over a 1-year period. Indicators of drug resistance such as outcome, viral load, and drug adherence can be monitored using this method. Measuring Broader Impacts Most participants agreed that in addition to measuring HIV/AIDS im- pacts of PEPFAR interventions directly, a broader interpretation of impact is also meaningful. Large-scale vertical programs such as PEPFAR can have far-reaching effectsâeither intended or unintendedâbeyond HIV/AIDS. In addition, as speaker Compton observed, PEPFAR and other donors are increasingly investing in less narrowly defined interventions that are not so amenable to a conventional evaluation framework of inputs, outputs, outcomes, and impacts. The branching out by donors to broader areas such as gender and nutrition has made impact evaluation increasingly complex. Workshop participants discussed some of the challenges and opportunities in adapting a traditional evaluation framework to measure broad impacts or unintended impacts of PEPFAR interventions. This section describes the measurement of the impact of the following: health systems strengthening, complementary interventions, gender-focused activities, coordination and harmonization, and population-level service delivery. Measuring Impacts of Health Systems Strengthening Workshop discussants brainstormed together on how changes in health systems can be measured. These include a broad range of factors related to health care service delivery, such as accessibility, quality, efficiency, and equity of services; management; procurement and distribution systems;

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 99 human resource use; policy environment; and infrastructure. Speaker Compton suggested that possible indicators to help track whether reliable and sustainable partner institutions are in placeâsimilar to a system-audit approach that many auditing organizations use to determine whether sys- tems have been established that would enable donors to give money directly to institutions with confidenceâcould include the following capabilities: to collect information and use it to make good decisions, to plan and budget efficiently, to implement projects effectively and efficiently, and to monitor and evaluate and to collect reliable numbers needed by members of Congress, Parliament, and others. Two case studies were also presented describing indicators and methodologies that can be used in evaluating health systems strengthening. Evaluating health systemâwide impacts of Global Fund interventions.â Work- shop speaker John Novak, senior monitoring and evaluation adviser of the Office of HIV/AIDS at the U.S. Agency for International Develop- ment, presented the experience of evaluating health systemâwide impacts of Global Fund interventions carried out in Ethiopia, Malawi, and Benin. The evaluation effort was carried out by the System-Wide Effects of Global Fund (SWEF) network, a collaboration of research institutions seeking to understand how global health initiatives affect the broader health system. A core assumption of the evaluation framework is that programs that mas- sively infuse resources into country health systems can improve or detract from health system accessibility, quality, efficiency, and equity. In Benin, The Global Fund provided 15 percent to 20 percent of the government spending per capita; in Ethiopia and Malawi, it provided 50 percent. Such effects can be intended or unintended. Therefore, any evaluation should go beyond vertical programs and focal diseases to assess effects on the entire health system. The SWEF evaluation assessed impact of Global Fund interventions on the following four parameters: â¢ Policy environment (harmonization, alignment, and ownership) â¢ Human resource use (number, allocation, skills, retention, and motivation of health workers) â¢ Publicâprivate services and collaborations (number, distribution, and organization of actors; trust and cooperation between sectors) â¢ Pharmaceutical and commodity procurement and distribution systems Both quantitative and qualitative methodologies were used in the evalu- ation. Quantitative facility surveys were conducted in a sampling of health facilities to assess staffing, management, patient referrals, drugs and sup-

100 EVALUATING THE IMPACT OF PEPFAR plies, lab services, and curative care services. Quantitative provider surveys were used to measure impact on individual providers and facilities receiving funds and to assess training, supervision, motivation, and job satisfaction. In-depth qualitative interviews with important stakeholders were also con- ducted throughout the entire health system. Novak stressed the importance of monitoring both positive and nega- tive impacts of interventions, which can help countries address critical issues in the health system. For example, although the SWEF evaluation re- sults showed positive impacts on the health systemâsuch as greater partici- patory engagement, decentralization, the emergence of new publicâprivate collaborative arrangements, creation of improved incentives and work en- vironment for those working in HIV/AIDS, and harmonization of pricing and cost-recovery approachesâthere were also some negative impacts, such as delivery-level constraints as HIV/AIDS drew both human resources and services away from other health areas, and poorly functioning procurement and distribution systems in some countries. Challenges of using this more descriptive methodological approach include the lack of empirical estimates of impacts, small sample size, short time interval over which change was evaluated, and lack of ability to easily attribute impact. Evaluating impact of HIV/AIDS interventions on non-HIV primary health care services.â Jessica Price, Rwanda country director of Family Health International (FHI), presented results from a study conducted in Rwanda testing the hypothesis that HIV/AIDS interventions strengthened the num- ber of non-HIV primary health care services. Study data were derived from the review of monthly activity reports submitted by health centers to the government of Rwanda. The study compared the quantity of non-HIV health services delivered before and after the introduction of basic HIV care, defined as services including counseling and testing, PMTCT, preven- tive therapy, and basic upgrades to health center infrastructure. The study assessed 30 FHI partner health centers from 4 provinces and 14 districts in Rwanda, representing 21 faith-based centers and 9 public centers. Hospitals that do not deliver some non-HIV services and health facilities with fewer than 6 monthsâ experience delivering basic HIV care were excluded from the study. A set of 88 indicators of non-HIV services delivery was tracked, with 22 indicators considered to represent the best range of public health ser- vices. These included general services (such as inpatient and outpatient ser- vices and lab tests), reproductive health services, and services for children. In addition to monitoring impacts of HIV/AIDS interventions, the study also tracked impacts of two other health programsâprimary health care insurance and performance-based financingâand used regression analysis

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 101 to isolate the independent effects of HIV/AIDS interventions. The analysis consisted of calculating mean quantities of non-HIV services delivered per primary health center per month between the two time periods, testing for significant differences, and conducting regression analysis to control for experience with other health programs (insurance and performance-based financing) to determine which program, if any, had an independent effect on the observed change. The HIV programs were shown to have had an independent effect in a number of indicators across a range of areas. These areas included improved coverage for antenatal visits and services, use of health care facilities for maternity services by HIV-positive women, syphilis screening, family planning services, child vaccination and growth-monitoring services, outpatient consultations, and hospitalization services. Limitations and challenges of the methodology were discussed. In fu- ture analyses, evaluation of the impacts of HIV programs should also include hospital settings. Indicators could also be tracked for impacts on other diseases (such as, malaria, TB, and sexually transmitted infections), quality of patient care, costs of HIV-specific services (such as HIV tests) versus non-HIV-specific services (such as infrastructural upgrades like incin- erator construction and maintenance of electricity), and client and provider satisfaction. Future studies should also look at larger sample sizes over longer time periods. A random selection of sites should also be considered in future studies, noted speaker Field-Nguer. The fact that all chosen sites were FHI partners may have given them a competitive edge, she noted. If being FHI sites did not confer an edge, then perhaps access to services can be replicated at any site in Rwanda. But if FHI status did confer an edge, then perhaps unique attributes of the partnership can tell us something about how to replicate the impact, she noted. Workshop participant Laura Porter of CDC added that future studies will need to ensure that service delivery improvement is a real effect and not just an artifact of data system improvement. Measuring Impact of Complementary Interventions As described in Chapter 2, PEPFAR investments include numerous interventions in programs complementary to more narrowly focused HIV services. These so-called wraparound programs include interventions in ar- eas such as malaria, TB, nutrition education, food security, social security, education, child survival, family planning, reproductive health, medical training, health systems, and potable water. Workshop speaker Bertozzi described methodologies from two case studies from Mexico in which such complementary interventions were

102 EVALUATING THE IMPACT OF PEPFAR evaluated: a human-capacity development program for children and a food assistance program. The Oportunidades program is a Mexican governmentâsponsored human-capacity development program for Mexicoâs poorest children. Fi- nancial incentives to parents are offered through the program for ensur- ing childrenâs participation in health, nutrition, and educational services. The Programa del Apoyo Alimentario (PAL) program provided food assistanceâeither food or cash paymentsâto small rural communities in Mexico. Impact evaluations of both Oportunidades and PAL were con- ducted using prospective randomized evaluation, in which later program enrollees were compared to earlier program enrollees. Both health impacts and education impacts were monitored through the evaluations. For Opor- tunidades, health indicators tracked include use of preventive services (such as well visits and vaccinations), use of curative services, out-of-pocket ex- penditures, and anemia prevalence. PAL health impacts monitored included height-for-age, weight-for-height, and weight-for-age. Education indicators monitored in the Oportunidades program included grade-level achievement, attendance, early enrollment, and repetition of grades. The evaluative approach from these studies could potentially be applied to the evaluation of complementary interventions in the PEPFAR program, particularly to health and educational interventions targeting orphans and vulnerable children, noted Bertozzi. Other indicators of âbasic capabilityâ child care interventions could include zinc status, sick days, days incapaci- tated, prevalence of risky and healthy behaviors (such as alcohol use, sexual activity, and exercise), and educational performance. Bertozzi emphasized the importance of controlling for secularâlong- term, noncyclicalâtrends in impact evaluation. Such trends can sometimes have a large effect independent of the intervention. For example, malnutri- tion indicators were tracked in the poorest rural communities in Mexico in the 5 years leading up to the start of the PAL program (ENN-1999 versus PAL-2004, the baseline for the PAL intervention). In the absence of any intervention, noted Bertozzi, extraordinary secular trends led to a halving of malnutrition indicators in these communities. Any intervention conducted during this 5-year period would have given the appearance of stimulating a large positive effect when there might have been none at allâor perhaps even a negative effect. Measuring Impacts of Gender-Focused Activities Workshop participants discussed some of the challenges and oppor- tunities for evaluating the impacts of gender-focused activities, including those interventions to promote gender equality and womenâs empowerment. Noting that gender equality and womenâs empowerment are multidimen-

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 103 sional, open, complex, nonlinear, and adaptive systems, speaker Glenzer observed that it is seldom clear what variables are or are not involved. It is a challenge to define what constitutes success and what it looks like on the ground. Glenzer said some of the difficulty in tracking change of gender systems relates to the following characteristics: the large-scale ef- fects of small changes over time, the separation of causes and effects over large spatial and temporal scales, the multiple levels over which change may occur, and the heterogeneity of systems. Speaker Julie Pulerwitz of the Population Council acknowledged the difficulty in implementing rigorously designed evaluations and called for more consensus building about how to operationalize the concept of gender and how to evaluate gender-related activities. Although gender is generally recognized as important, she added, there have been few outcome evaluations and few tools developed on how gender-focused activities affect HIV risk. Few good indicators exist that are useful in understanding social dynamics, and evaluation schemes often underrepresent the perspectives of local people, who are a source of such knowledge, noted Glenzer. Speaker Pulerwitz described a new method now available for studying the impacts of gender-focused activities and how those impacts can con- tribute to PEPFAR goals. Pulerwitz directs an operations research program called Horizons at the Population Council that has conducted studies using this method. Pulerwitz shared the study design and tools used for an evalu- ation of gender-focused programsâgroup education, community-based be- havioral change communication campaigns, and clinical activitiesâfocused on young men in Brazil. A combination of data collection approaches were used, including the following: â¢ Pre- and postintervention surveys and a 6-month follow-up survey for three groups of young menâtwo intervention groups and a compari- son group, which eventually also received the interventions after a time delayâfollowed over a year â¢ In-depth interviews with a subsample of young men and their sexual partners â¢ Costing analysis and monitoring forms for different activities An evaluation tool called the Gender Equitable Menâs (GEM) scale was used to look at gender norm attitudes and how they changed over time (Barker, 2000; Pulerwitz and Barker, 2008). The scale includes 24 items, including parameters such as home and child care, sexual relationships, health and disease prevention, violence, homophobia, and relations with other men. Certain GEM scale domains are associated with partner vio- lence, level of education, and contraception use. The GEM tool was used to detect significant changes in attitude toward equitable gender norms and

104 EVALUATING THE IMPACT OF PEPFAR in support of inequitable gender norms in the two intervention groups as compared to the control group. HIV outcomesâcondom use with primary partnersâwere also tested, and one of the intervention groups showed an increase as compared to the comparison group. The study also looked at covariance between changes in attitudes toward norms and changes in con- dom use; men who were more gender equitable were more likely to report condom use. The in-depth interview component of the analysis unearthed other changes among those in the test groups, including a delay in sexual activity in new relationships. The evidence generated by the evaluation is supportive of interventions that target gender dynamics and their influence on HIV risk behavior in Brazil, concluded Pulerwitz. She noted that there are ongoing or planned efforts to adapt the GEM tool to other country contextsâIndia, Ethiopia, Namibia, Uganda, and Tanzaniaâand to other demographic groups, such as married men. Preliminary findings show that results can be highly coun- try specific. Although a similar trend toward more equitable attitudes has been observed in the work conducted in India, baseline attitudes in that country are much less supportive of equitable gender norms than those in Brazil. Measuring Coordination and Harmonization Workshop speaker De Lay spoke of a new opportunity for measur- ing coordination and harmonizationâthe alignment of interventions with country-level plans and coordination of efforts among other implementing partners. A new tool, known as the Country Harmonization and Alignment Tool (CHAT), developed by UNAIDS and the World Bank, is now avail- able and could be applied to the standardization of alignment of interven- tions with country-level plans and coordination of efforts among partners (UNAIDS, 2007a). The tool has been used to assess harmonization and alignment of the national plan, coordinating mechanism, and M&E plan in six pilot coun- tries, and a launch of the tool is planned in two more countries. The tool has revealed that many national plans are still not credible, not costed appropriately, not prioritized, and not actionable. In addition, the tool has shown that few countries have a central funding channel or single procurement system for the HIV/AIDS response. The tool has also shown that âbasket funding,â or joint funding by multiple donors, is not normally used. Although donors support the notion of the development of indigenous national M&E capacity, the tool has revealed that in practice donors usu- ally rely on their own M&E systems to collect urgent data when needed.

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 105 Measuring Community-Level or Population-Level Service Delivery Workshop speakers spoke of the challenges of scaling up successful service-delivery interventions for specific populations, such as children, families, communities, and high-risk groups. As workshop speaker Bertozzi observed, sometimes it is difficult to distinguish between a community- level or population-level effect and the effect of an intervention. Tools are needed, noted speakers Kathy Marconi of OGAC and Stoneburner, to measure the effectiveness of interventions in specific populations, including communities, diverse populations, and at-risk or infected populations. Speaker Field-Nguer announced that a new and important addition to the evaluation toolbox is now available: community-level program informa- tion reporting systems (CLPIR) (personal communication, R. Yokoyama, John Snow, Inc., January 18, 2008). CLPIR indicators look strictly at community-level service delivery and help answer questions such as when, how, and where people want testing and treatment. Attributing Impact Given the diversity of programs and funders, attributing impactâor relating a particular effect to the work of a specified agentâis a substantial methodological challenge in evaluation, workshop participants said. The World Bank experience shows that because loans or grants are made to governments, speaker Ainsworth said, performance of activities depends heavily on governments, and it is therefore difficult to disentangle the efforts of government and any particular donor from the efforts of all other donors. Even within the programs of a single donor, noted speaker Gootnick, accounting can be complex. Some interventions can be double counted; for example, voluntary counseling and testing is included under both the prevention and care modalities. As PEPFAR moves increasingly toward more harmonized approaches, noted speaker Compton, it will be even more difficult to disentangle effects in an exclusive way. Many workshop participants agreed that the demand for exclusive at- tribution by donors may not be constructive. General evaluation of what is and is not working, in contrast, may be desirable, noted workshop modera- tor Ruth Levine of the Center for Global Development. Speaker Glennerster emphasized that it is preferable to test what works in very specific areas and then judge a program by whether it spends money on interventions whose effectiveness is supported by evidence. All programs are doing many things in-country; they are implementing many different policies. If we want to be effective in focusing resources on what works, we need to identify which interventions have the most impact and which are most cost-effective, she said. Speaker Diaz reinforced this idea, stating that a worthwhile attribu-

106 EVALUATING THE IMPACT OF PEPFAR tion goal should be to know the effectiveness of certain programs and their coverage in terms of impact measures. A useful attribution exercise, she sug- gested, might be to determine what level of ART coverage decreases general mortality and what types of prevention activities, in which populations, decrease HIV incidence. Ainsworth added that it is nevertheless useful to analyze the value added of the unique approaches of particular donors. An important dimension of attribution is the concept of the counter- factual, or the assessment of what would have happened differently had the donor not intervened. Some speakers noted that absence of the donor does not necessarily imply that nothing would have happened. Discussant Jim Sherry of George Washington University observed that one consequence of donor interventions is that the donor occupies a particular space and pre- vents other organizations from filling it. As speaker Bertozzi pointed out, in the case of South Africa, even if outside institutions did not intervene, given the massive social mobilization potential in the country, dramatic change could have been effected without outside help. Aggregating Evaluation Results Several speakers noted that the synthesis or aggregation of evaluation results is a methodological frontier. Workshop participant David Dornisch of the U.S. Government Accountability Office proposed that meta-analysis or synthesis could be used to bring together the results of multiple studies. From the congressional perspective, workshop participant Naomi Seiler from the U.S. House of Representatives Oversight Committee also stated that while prospective evaluation is useful, any type of meta-analysis or syn- thesis of what is already known about types of interventions, contexts, and populations would be helpful. Discussant Jimmy Kolker of OGAC echoed the need for data synthesis to be relevant to designing or implementing a program. Workshop discussant Sherry observed that such methods have yet to be developed, however. Sherry predicted that the clustering of country-level assessments and evaluations will likely provide much more information through meta-analysis than one definitive, globally executed impact study. Although there is room for both kinds of evaluations, he noted, there is substantial room for improvement on meta-analysis to look statistically at the results of these studies. Sherry observed that there may be inadequate separation of macro-, micro-, and meta-level evaluation processes, leading to an evaluation either not making sense to policy makers or not being rigorous enough for scientists. Micro-level evaluation tends to be too tech- nical and too situation-specific to be digestible to institutions or useful for interventions. Macro-level evaluation tends to be too soft and too subject to evaluation spin to be digestible or credible. Durable findings are needed

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 107 about programs that allow for more sustainable dialogue and learning at the meta-level in terms of evaluation. Another workshop participant raised a question about the value of performing multiple evaluations. Speaker De Lay commented that although it is sometimes desirable to avoid duplication where it is not needed, some- times duplication is necessary and multiple perspectives are desirable. For example, validation of existing data by an independent group is often a useful alternative to redoing an entire study. Themes Common to Evaluation Methodologies and Approaches This section distills some of the main messages and themes common to the discussions about evaluation methodologies and approaches. Prioritization Most evaluations require some type of prioritization to narrow down what is to be measured. Speaker Ainsworth noted that for long-term evalu- ations, for example, one might select only those issues common to all proj- ects. For a large portfolio of activities, she added, one might select a more narrowly defined set of indicators. Value of Consultation and Communication Several speakers emphasized the value of consultation and communica- tion in any evaluation approach. Speakers Compton and Glenzer observed that consultation and communication through the evaluation process are as important in effecting change and course corrections as the data from the evaluation results. It also matters who is consulted, observed speaker Field-Nguer. Value of a âLearningâ Evaluation Many of the evaluation methodologies described were formative, or âlearningâ evaluations, designed to help improve institutional performance. As Glenzer noted, evaluation is a long-term learning experience that should unite relevant actors. Speaker Ainsworth added that bringing to bear the findings of past support can inform ongoing programs. Using evaluation to understand the variation in outcomes, or the distribution of outcomes within a population, can help us learn, she said. For example, changes in the average life expectancy or the average change in behavior is not as

108 EVALUATING THE IMPACT OF PEPFAR interesting as knowing why behavior changed in one group of people but not another. Others emphasized the heuristic value of negative evaluation results. Analysis of failures, observed speaker Field-Nguer, is sometimes more fruit- ful than success stories. Negative evaluation results should be divulged and shared, one workshop participant urged; if they are not shared, programs lose credibility and waste money. Speaker Glenzer noted that all of CAREâs research reports are published on Emory Universityâs website and include some research indicating that CARE is not having long-term impacts on womenâs empowerment or underlying causes of gender inequality. The emphasis on learning evaluations contrasts with a more typical systemic bias in the international health community in which actors want to see programs continue, noted workshop discussant Sherry. Therefore, in- stead of using evaluation for learning, it is used to protect our interests and programs. Sherry underscored the importance of sustaining the institutional learning process. The isolation of evaluation departments in international health systemsâanalogous to the isolation of smart and reflective people in universities, organized into separate compartments so they have minimal effect on the society around themâis one obstacle to institutional learning, he noted. Decision-making cycles, such as 5-year cycles, reauthorizations, or external audits, drive evaluators into prominence briefly but then fade away. Also observing the existence of different consumers of evaluation, speaker Nils Daulaire of the Global Health Council emphasized the impor- tance of having a single M&E system that satisfies multiple sets of needs. For example, if a customer for evaluation is Congress, then the evaluation will emphasize putting on the best possible spin, but that must be balanced with the use of evaluation on a daily basis to help improve program de- velopment and results. One step in achieving a multiuse system is to give evaluators a role in program management and development as opposed to a peripheral role in projects. Importance of Designing the Evaluation Early Several speakers emphasized the importance of considering evalua- tion design early in the implementation process so that the design will be appropriate and so that impacts can be detected early. Speaker Compton urged that evaluations be set up at the beginning of the process, and speaker Bertozzi also spoke about some of the drawbacks of an ex-post evaluation. Speaker Glennerster noted that opportunities to use powerful randomiza- tion approaches exist, but they can be used only if the design is included at the beginning of an intervention. Field-Nguer and Bertozzi stressed the importance of baseline assessments, without which the wrong conclusions may sometimes be drawn.

DESIGNING AN IMPACT EVALUATION WITH ROBUST METHODOLOGIES 109 Understanding the Limitations of Models and Data Workshop participants acknowledged the limitations of data and mod- els used in evaluation. Speaker PacquÃ©-Margolis emphasized that empirical data are often inadequate, lacking, or inaccurate, and speakers Ainsworth and Compton emphasized that poor data quality at the country level is often a serious problem. Speaker Garnett emphasized the existence of data gaps for measuring efficacy in different epidemiological contexts. Age- and sex- specific empirical data are also lacking, noted discussant Fowler. Ainsworth stressed that incentives need to be created to encourage project staff and governments to establish and maintain monitoring efforts. Not all data are of the same quality, participants said. Speaker Glennerster noted that data based on self-reported behavior might have issues regarding reliability. Models are powerful tools that can help in evaluation, but they also have limitations. Speaker Glennerster pointed out that models need to be validated with empirical data, and variables need to be added to them to make them more accurate predictors. Speaker Garnett also observed that models are less reliable predictors when the spread of HIV infection be- comes epidemic. Value of Multiple Methodologies Several presenters noted the value of using multiple methodological approaches in evaluation. Speakers Compton and Ainsworth cautioned against relying exclusively on one evaluation methodology, and speaker Field-Nguer pointed out that multiple methods may yield richer results than one or two methodologies. Field-Nguer also noted that lack of a base- line assessment (as was the case in PEPFAR) may increase the importance of using several methodologies, including qualitative measures. Speaker Glenzer reinforced the point with his comment that centrally planned, mixed-method evaluation designs work best. At the same time, the use of multiple methods should be strategic, noted workshop speaker Glennerster. She noted that currently organiza- tions often conduct a confused mix of process/output and impact evalu- ations in too many places. Instead, she recommended conducting good process evaluations everywhere and a moderate number of high-quality impact evaluations focusing on a few key questions. Value of Randomization Multiple presenters emphasized the value of randomization tools in the conduct of evaluations. Glennerster pointed out that new methods of randomization are now available that integrate with evaluation with

110 EVALUATING THE IMPACT OF PEPFAR minimal disruption. In his presentation, Bertozzi also drew on evidence from randomized controlled trials. Speaker Field-Nguer pointed out that nonrandom selection of sites has the potential to limit or weaken a study. Workshop participant De Lay discussed some of the potential problems with impracticality of randomization. Comparison Across Contexts Several workshop participants stressed the highly contextual nature of change when comparing across contexts. Evaluations that are centrally coordinated to permit comparison of variables across contexts, while al- lowing some flexibility in indicator design at the local level, are optimal, suggested speaker Glenzer. Interventions that are successful in one country are not necessarily transferable to another country, noted workshop speaker Stoneburner. Examples provided by Stoneburner and speakers Latkin, Garnett, and Pulerwitz supported this statement. In some cases, factors independent of an explicit program intervention can have an influence on change. In other cases, change in behavior does not always lead to a change in the pattern of the HIV/AIDS epidemic, and changes in the pattern of the epidemic cannot always be translated to a change in behavior. Close en- gagement of the scientific community in evaluation, urged speaker Latkin, can help to assess the likelihood of transferability of effective programs to other settings.

Next: Appendix A: Agenda »

Design Considerations for Evaluating the Impact of PEPFAR: Workshop Summary (2008)

Chapter: 4 Designing an Impact Evaluation with Robust Methodologies

Welcome to OpenBook!

Get Email Updates