6
Implementing Impact Evaluations in the Field

INTRODUCTION

A counterfactual question—how would things have looked in the absence of the U.S. Agency for International Development (USAID) program?—lies at the core of any design for impact evaluations. Chapter 5 made the case that randomized evaluations provide the soundest methodology for generating definitive answers to this question. However, it is one thing to specify what may be optimal theoretically and another thing altogether to implement that methodology on the ground. Practical impediments may make the implementation of randomized evaluation difficult, even impossible, at least in a pure form. For example, factors outside of USAID’s control may render it not feasible to gather baseline data, to identify and monitor outcomes in a control group, or to select by lottery the units in which programs should be implemented. Although Chapter 5 provided examples of several successful randomized evaluations, only a handful of these are in the democracy and governance (DG) area, and none of them are examples of evaluations of USAID’s own programs. Thus, even if willing to accept the desirability in principle of adopting the methodology of randomized evaluation, it is reasonable to wonder how readily it can be applied to the sorts of programs that USAID missions in the field regularly undertake.

To find out, the committee commissioned three expert teams to visit USAID missions overseas in an effort to assess the viability of impact evaluations for past and present DG programming. The key task for each team was to talk with implementers, local partners, and USAID mission



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 151
6 Implementing Impact Evaluations in the Field INTRODUCTION A counterfactual question—how would things have looked in the absence of the U.S. Agency for International Development (USAID) pro- gram?—lies at the core of any design for impact evaluations. Chapter 5 made the case that randomized evaluations provide the soundest meth- odology for generating definitive answers to this question. However, it is one thing to specify what may be optimal theoretically and another thing altogether to implement that methodology on the ground. Practical impediments may make the implementation of randomized evaluation difficult, even impossible, at least in a pure form. For example, factors outside of USAID’s control may render it not feasible to gather baseline data, to identify and monitor outcomes in a control group, or to select by lottery the units in which programs should be implemented. Although Chapter 5 provided examples of several successful randomized evalua- tions, only a handful of these are in the democracy and governance (DG) area, and none of them are examples of evaluations of USAID’s own programs. Thus, even if willing to accept the desirability in principle of adopting the methodology of randomized evaluation, it is reasonable to wonder how readily it can be applied to the sorts of programs that USAID missions in the field regularly undertake. To find out, the committee commissioned three expert teams to visit USAID missions overseas in an effort to assess the viability of impact evaluations for past and present DG programming. The key task for each team was to talk with implementers, local partners, and USAID mission 

OCR for page 151
 IMPROVING DEMOCRACY ASSISTANCE personnel on the ground to assess the feasibility of actually implementing in practice the evaluation methodologies outlined in the previous chapter. The first part of this chapter presents the results of those field visits. The second part provides responses to the most commonly raised objections that the committee and its field teams heard expressed about the use of randomized evaluations in DG programs. Before turning to the details of what the field teams found, it is impor- tant to highlight a clear and consistent message that came through from all three field visits. All three teams concluded, first, that the introduction of randomized ealuations into USAID project ealuation was both feasible and cost-effectie in many of the contexts they inestigated. They were unanimous that, where possible, adopting such methods would represent an improve- ment over current practices. Second, they reported that, for projects where randomized ealuations were not possible, other improements to USAID ealua- tion—for example, improed measurement, systematic collection of baseline data, and comparisons across treated and untreated units—also hae the potential to yield significant improements in the agency’s ability to attribute project impact. These issues are discussed in Chapter 7. Finally, the teams returned from the field energized by their interactions with mission staff and confident that a willingness, and even excitement, exists about improving the qual- ity of project evaluations. The teams were also impressed with some of the work already being done as part of current project monitoring, in particular in the broadening of measurement strategies beyond project outputs to include an assessment of outcomes. FIELD vISITS TO USAID MISSIONS As a complement to the deliberations in Washington and extensive engagement with USAID staff and implementers, the committee felt strongly that its recommendations should be informed by a set of extended field visits to USAID missions. The committee therefore identified a set of missions, representing a diversity of regions, that were engaged in sub- stantial programming on DG issues and were in the process of designing large, new projects in one of USAID’s core DG areas (rule of law, elections and political processes, civil society, and governance). From the list of missions provided, USAID explored the willingness of the missions to host the team and consider new approaches to project evaluation. After negotiating issues of timing and access, USAID and the committee agreed to send field teams to Albania, Peru, and Uganda. The field visits were intended to accomplish three main goals: 1. to better understand current strategies used for project evaluation, including approaches to data collection;

OCR for page 151
 IMPLEMENTING IMPACT EVALUATIONS IN THE FIELD 2. to explore the feasibility of introducing impact evaluations in the future, including (but not limited to) randomized evaluation; and 3. to obtain the perspectives of mission personnel and USAID implementers regarding the possibilities for, and impediments to, new approaches to evaluation. The committee encouraged the field teams to explore the range of DG activities currently under way in each mission, assess the adequacy of current evaluation approaches, and provide concrete examples of how existing approaches could be improved. In addition, the field teams were directed to focus particular attention on the development of an impact evaluation design in one specific area in each mission. The teams focused on local government/decentralization in Albania and Peru and support for multiparty democracy in Uganda.1 Each field team was composed of methodological consultants, aca- demic or other experts with relevant experience in research design or program evaluation and DG issues, and country or regional expertise; a Washington-based USAID staff member who was familiar with the mission, the committee’s work, and USAID policies and practices; and National Research Council professional staff, who assisted the consultants in meeting the team’s objectives and coordinated the logistics of the field visits. In evaluating the findings of the three field teams, it is important to keep in mind that the field teams visited missions that had expressed an interest in improving their evaluation strategies. The field teams’ conclu- sions about the applicability of impact evaluations, especially its sense that standard objections to these designs can be addressed, thus reflect the experiences gleaned from this (nonrandom) sample of missions. It is not known if other missions, especially smaller ones with leaner budgets or those in countries experiencing violent conflicts or particularly rapid political change, would be as amenable to new approaches to evaluation: The committee has no control group of non-self-selecting missions with which to compare its findings. Yet the committee believes it unlikely that missions that did not invite the committee to send a field team would have offered novel additional objections. Over the 15 months of the study period, the committee talked with numerous USAID staff and implement- ers from a variety of areas and with backgrounds and experience with DG programming in a great many countries, and the set of objections that are taken up in the second part of this chapter dominated the responses of everyone with whom the committee spoke. 1 Key results of the field visits are discussed in this chapter and the next. Additional infor- mation can be found in Appendix E.

OCR for page 151
 IMPROVING DEMOCRACY ASSISTANCE EMPLOyINg RANDOMIzED IMPACT EvALUATIONS FOR USAID Dg PROJECTS IN THE FIELD Randomized evaluations are widely considered the best method for determining the causal effects of treatment in a broad range of areas, including public health, education, microfinance, and agriculture. As the Olken (2007) and Gugerty and Kremer (2006) studies described in Chapter 5 show, such methodologies are also beginning to be applied to evaluate the effectiveness of projects in the area of democratic governance. Nonetheless, the committee learned from its consultations with USAID staff and implementers that there is a general feeling that randomized evaluation was not an option for many of the projects that USAID carries out. Even in those cases where randomized evaluations might be possible theoretically, the assumption among USAID staff seemed to be that such approaches would be too difficult to implement in practice, owing to an inability to select treatment groups by lottery, the difficulty of preserv- ing a control group, the difficulty of identifying good indicators for key outcomes, the high cost of the extensive data collection that would be required, or the tension between the flexibility staff believe they need to respond to opportunities and challenges as projects go forward and the need to minimize changes to ensure an effective evaluation. These are legitimate concerns. To address them, this section discusses how randomized evaluations could be used in current USAID projects, drawing on examples gleaned from the field visits and consultations with practitioners. We begin with a decentralization project in Peru that has already been implemented, outlining how the project monitoring strategy that was employed could have been adjusted to accommodate a random- ized component that would have made it an impact evaluation design and showing how such an adjustment would have permitted the mis- sion to generate much stronger inferences about project impact.2 Then a planned multipronged effort to support multiparty democracy in Uganda is described, emphasizing how pieces of the existing project might be amenable to randomized evaluation and showing how adopting such an evaluation method would improve USAID’s ability to assess the project’s effects.3 The committee’s goal is to use these projects as illustrations of the potential payoffs that could accrue from improved evaluation strategies. 2 The discussion here of decentralization in Peru is drawn from the report of a field team led by Thad Dunning, assistant professor of political science, Yale University. 3 These designs were developed by a team led by Devra Moehler, assistant professor of political science, Cornell University.

OCR for page 151
 IMPLEMENTING IMPACT EVALUATIONS IN THE FIELD Decentralization in Peru USAID/Peru launched a project in 2002 to support national decen- tralization policies initiated by the Peruvian government. Over a five-year period, the Pro-Decentralization (PRODES) program was intended to: • support the implementation of mechanisms for citizen participa- tion with subnational governments (such as “participatory budgeting”), • strengthen the management skills of subnational governments in selected regions of Peru, and • increase the capacity of nongovernmental organizations in these same regions to interact with their local government (USAID/Peru 2002). With the exception of some activities relating to national-level policies, all interventions under the project took place in seven selected subna- tional regions (also called departments): Ayacucho, Cusco, Huanuco, Junin, Pasco, San Martin, and Ucayali.4 These seven regions contain 61 provinces, which in turn contain 536 districts.5 Workshops on participa- tory budgeting, training of civil society organizations (CSOs), and other interventions took place at the regional, provincial, and district levels. 6 The ultimate goal of the project was to promote “increased respon- siveness of subnational elected governments to citizens at the local level in selected regions” (USAID/Peru 2002). This outcome is potentially mea- surable on different units of observation. For example, government capac- ity and responsiveness could be measured at the district or provincial level (through expert appraisals or other means), while citizens’ percep- tions of government responsiveness may be measured at the individual level (through surveys). The PRODES decentralization project represented an ambitious effort. By all accounts it was a well-executed program; the performance of the local contractor received high marks from mission staff at USAID/Peru. The questions of interest here do not relate to the performance of the con- tractor in relation to project outputs or very proximate outcomes, which 4 The regions were nonrandomly selected for programs because they share high poverty rates, significant indigenous populations, and narcotics-related activities and because a number of the departments were strongholds for the Shining Path movement in the 1980s. 5 Peru has 24 departments plus one “constitutional province”; the 24 departments in turn comprise 194 provinces and 1,832 districts. Provinces and districts are often both called “municipalities” in Peru and both have mayors. Sometimes two or more districts combine to form a city, however. 6 Relevant subnational authorities include members of regional councils, provincial may- ors, and mayors of districts.

OCR for page 151
 IMPROVING DEMOCRACY ASSISTANCE were the focus of the project monitoring plan used by the implementer.7 Instead, the question is how we could know whether such a project had impacts on targeted policy outcomes, such as the responsiveness of local governments to citizens’ demands. Since the project was not designed with impact evaluation, as defined here, in mind, it suffered from a number of serious deficiencies in that regard. The main deficiencies parallel the general points raised in Chap- ter 5: the absence of indicators for at least some of the most important policy outcomes, the absence of comparison units, and the absence of treatment randomization. Taken together, these shortcomings present almost insuperable obstacles to an impact evaluation. One important find- ing of the team was that with foresight some of these deficiencies might have been fairly easily corrected and for not much additional cost. Indeed, some of the changes outlined below would likely yield cost saings. As mentioned, the decentralization project sought to foster citizen participation, transparency, and accountability at the local level, with the ultimate objective of promoting “increased responsiveness of subna- tional elected governments to citizens.” Though some of these outcomes are potentially, albeit imperfectly, measurable, indicators gathered at the local level related almost exclusively to outputs rather than outcomes. For example, the indicators gathered included the percentage of munici- palities that signed “participation agreements” with local contractors; the percentage of participating municipalities from which at least two individuals (local authorities or representatives of CSOs) attended a train- ing course in participatory planning and budgeting; the percentage of targeted provincial governments in which at least two CSOs exercised regular oversight of municipal government operations, as measured by participation in at least two public forums during the year; and the per- centage of participating local governments that establish technical teams to assist with decentralization efforts (PRODES PMP 2007). Such indicators are designed to monitor the implementer’s perfor- mance and perhaps measure very proximate outcomes, such as formal participation in the decentralization process. However, they do little to help discern the impact of interventions on the main outcomes that the project was designed to affect. For purposes of evaluating impact—and even for improved project monitoring—we want to know not how many training courses there were or how many officials attended them but rather whether they led subnational elected governments to be more responsive to their citizens.8 7A description of current USAID project monitoring can be found in Chapter 2. 8 TheUSAID/Peru team and local contractors were clearly aware of the distinction be- tween measures of contractor performance and measures useful for assessing impact; this

OCR for page 151
 IMPLEMENTING IMPACT EVALUATIONS IN THE FIELD Several indicators gathered through surveys did tap citizens’ percep- tions of the responsiveness of subnational elected governments in targeted municipalities. Surveys taken in 2003, 2005, and 2006 asked respondents: Are the services provided by the (district, provincial, or regional) gov- ernment very good, good, average, bad, or very bad? Another question, administered only in the 2003 and 2005 surveys,9 asked: Do you think that the (district, provincial, or regional) government is responsive to what the people want almost always, on the majority of occasions, from time to time, almost never, or never? (PRODES PMP 2006, 2007). In principle, such survey questions may provide useful proxy mea- sures of the outcomes of interest. In practice, however, there were a num- ber of issues that limited the usefulness of these measures. First, only the first question was asked in a comparable manner across all three surveys, allowing for a very limited time series on the outcome of interest. Second and perhaps more importantly, as discussed further below, was the failure to gather measures on control units in all but the 2006 survey. Finally, a “baseline” assessment of municipal capacity was prepared at the start of the program by a local institution. All district and provin- cial municipalities in the seven selected regions were coded along several dimensions, including extent of socioeconomic needs and management capacities of district and provincial governments (GRADE 2003). Poverty rates and related indicators played a preponderant role in the local institution’s calculations, which may have limited the usefulness of the index for assessing changes in subnational government capacity or responsiveness. In theory, however, repeated assessments of this kind could have provided useful data on municipal capacity, which is an out- come of interest under the decentralization project. As far as the team could determine, the assessment was not repeated. USAID/Peru’s implementer was tasked with carrying out the decen- tralization project in all 536 districts of the seven selected regions. Once the rollout of interventions in all municipalities had been completed, no untreated municipalities remained available in the selected regions. The absence of appropriate control units (untreated municipalities) is perhaps the biggest problem for effective evaluation of the decentralization proj- ect. In addition, since rollout was completed by the second year of the program, there was little opportunity to compare outcomes in treated and untreated units in the seven regions. distinction is made in some of the relevant program monitoring plans (e.g., PRODES PMP 2006). However, most of the impact measures appear to be fairly proximate outcome mea- sures related to the process of supporting decentralization. 9 The 2003 and 2005 questions were administered as a part the Democratic Indicators Monitoring Survey, whereas for 2006, data came from the Latin American Public Opinion Project.

OCR for page 151
 IMPROVING DEMOCRACY ASSISTANCE In principle, comparisons could be made across treated municipali- ties in the seven selected regions and untreated municipalities outside these regions. Since the seven regions were nonrandomly selected on the basis of characteristics that almost surely covary with municipal capac- ity and subnational government responsiveness (e.g., high poverty rates, narcotics-related activities, past presence of the Shining Path), however, inferences drawn from such comparisons would be problematic, although not completely uninformative. In practice, however, the data do not exist for such comparisons because virtually no data were gathered on control units. The exception is the 2006 commissioned survey taken as a part of the Latin American Public Opinion Project (LAPOP), which administered a questionnaire to a nationwide probability sample of adults including an oversample of residents in the seven regions in which USAID works (Carrión et al 2007).10 This survey includes several questions that would be useful measures of the outcome variables (though only one question is comparable to questions asked in the earlier non-LAPOP surveys taken in treated municipalities in 2003 and 2005).11 The 2006 LAPOP national survey, had it been carried out beginning in 2003, could have established a national baseline against which the selected regions could have been measured before the program began.12 The project implementers would then have known, for example, if as was hypothesized, satisfaction with local government, participation in local government, corruption in local government, and so forth, were more problematic in the targeted regions than in the rest of the country. Since the regions selected were poorer and more rural than the nation as a whole, covariate controls could have been introduced in an analysis-of-variance design that could have statistically forced the nation and the control groups to look more alike. Then, in each subsequent round of surveys, comparisons could have been made between the nation and the targeted regions, thereby making it pos- sible to observe the rate of change. Had satisfaction with local govern- ment nationwide remained unchanged while the targeted areas showed increased satisfaction, project impact could have been established with a reasonable degree of confidence. Indeed, if national satisfaction had 10 In addition to 1,500 respondents in the nationwide sample, an oversample of 2,100 (300 per region) was taken from the seven regions (Patricia Zárate, Instituto de Estudios Pe- ruanos, personal communication June 2007). Inter alia, this survey asked respondents their opinions of the quality of local government services, as noted above. 11 The LAPOP instruments include questions that are comparable across 20 surveyed countries; see Seligson (2006). For useful information, the committee is grateful to Patricia Zárate, Instituto de Estudios Peruanos. 12 Of course, the national sample would need to have had removed from it any sample segments lying in the project area in order for the national “control” group not to have been contaminated by the project inputs.

OCR for page 151
 IMPLEMENTING IMPACT EVALUATIONS IN THE FIELD declined over the life of the project while the target areas held steady, this, too, could have been an indicator of project success. It is important to stress that since the mission was already regularly conducting national samples of public opinion, there would hae been no added data-gathering costs in the hypothetical strategy just proposed. The only cost would have been the minimal expense of analyzing the data. Outside the LAPOP 2006 survey, no data were gathered on untreated municipalities. The universe of the 2003 and 2005 surveys was limited to residents of the seven regions (and thus only to residents of treated municipalities). Evaluations of municipal capacity (e.g., the GRADE study mentioned above) were conducted only on districts and provinces in the seven selected regions. Although some data were collected in control municipalities outside the seven regions, the absence of a control group within the regions has serious consequences for evaluation. As just one example, many munici- palities in the seven regions had been ravaged by the conflict with the Shining Path during the 1980s and 1990s. Investment and population return have picked up in some areas during the past decade, especially the past five years; at least some of this upturn must be due to the end of the war and other factors.13 Improvements in measured municipal capacity or in citizens’ perceptions of local government responsiveness during the life of the program may, therefore, not be readily attributable to USAID support for decentralization. If control municipalities had been selected from the outset at random and the treatment municipalities had outperformed the controls, we would have greater confidence that the project had a positive impact. In sum, as discussed further below, if the project had been designed to permit rigorous impact evaluation rather than monitoring, a plan for gath- ering data on control units would have been created as part of the initial project design. Ideally, one would have compared treated and untreated municipalities inside the seven regions. In the absence of untreated munici- palities inside the regions, data could have been gathered on appropri- ately selected municipalities outside the region.14 Surveys should have included residents of untreated municipalities, and evaluations of munici- pal capacity (such as the GRADE study) should have included pre- and postmeasures on municipalities with which USAID/Peru’s contractor was not assigned to work. 13 Interviews, Ayacucho, June 27, 2007. 14 However, as discussed below, without assignment, data on controls may also not help with the inferential issues mentioned in the previous paragraph.

OCR for page 151
0 IMPROVING DEMOCRACY ASSISTANCE An Alternatie Ealuation Design It is possible, looking backward, to describe an ideal randomized impact evaluation design for the decentralization project that could have been implemented in 2002. Assume that the decision to implement the decentralization project in the seven nonrandomly chosen regions was not negotiable; inferences about the effect of the intervention would then be made to the districts and provinces that comprise these regions. The simplest design would involve randomization of treatment at the district level. Districts in the treatment group would be invited to receive the full bundle of interventions associated with the decentralization proj- ect (e.g., training in participatory budgeting, assistance for civil society groups); control districts would receive no interventions. There are two disadvantages to randomizing at the district level, however. One is that some of the relevant interventions in fact take place at the provincial level.15 Another is that district mayors and other actors may more easily become aware of treatments in neighboring districts. For both of these reasons it would be useful to randomize instead at the provincial level. Then all districts in a province that is randomly selected for treatment would be invited to receive the bundle of interventions. Several different kinds of outcome measures could be gathered. Sur- vey evidence on citizens’ perceptions of local government responsiveness would be useful, as would information on participation in local govern- ment and evaluations of municipal governance capacity taken across all municipalities in the seven regions (both treated and untreated). A difference in average outcomes across groups at the end of the project—for example, differences in the percentage of residents who say government services are “good” or “very good,” or the percentage who say the government responds “almost always” or “on the majority of occasions” to what the people want—could then be reliably attributed to the effect of the bundle of interventions, if the difference is bigger than might reasonably arise by chance. One feature of this design that may be perceived as disadvantageous is the fact that treated municipalities are subject to a bundle of inter- ventions. Thus, if a difference is observed across treated and untreated groups, it may not be known which particular intervention was respon- sible (or most responsible) for the difference: Did training in participatory budgeting matter most? Assistance to CSOs? Or some other aspect of the bundle of interventions? This problem arises as well in some medi- cal trials and other experiments involving complex treatments, where 15 Some interventions also occurred at the regional level, particularly toward the end of the project, yet these interventions constitute a relatively minor part of the project.

OCR for page 151
 IMPLEMENTING IMPACT EVALUATIONS IN THE FIELD it may not be clear exactly what aspect of treatment is responsible for differences in average outcomes across treatment and control groups. Despite this drawback, it seems preferable to design an evaluation plan that would allow USAID to know with some confidence whether a project it financed made any difference. Bundling the interventions may provide the best chance to estimate a causal effect of treatment. Once this ques- tion is answered, one might then want to ask what aspect of the bundle of interventions made a difference, using further experimental designs. However, another possibility discussed below is to implement a more complex design in which different municipalities would be randomized to different bundles of interventions. USAID/Peru is preparing to roll out a second five-year phase of the decentralization project, possibly again in the seven regions in which it typically works. At this point, all municipalities in the seven regions were already treated (or at least targeted for treatment) in the first phase. This may raise some special considerations for the second-phase design. The committee’s understanding is that there are several possibilities for the actual implementation of the second phase of the project; which option is chosen will depend on the available budget and other factors. One is that all 536 municipalities are again targeted for treatment. As in the first- phase design, this would not allow the possibility of partitioning munici- palities in the seven regions into a treatment group and controls. In this case the best option for an experimental design may be to ran- domly assign different treatments—bundles of interventions—to different municipalities. While such an approach would not allow comparison of treated and untreated cases, it would allow us to assess the relative effects of different bundles of interventions. This may be quite useful, particu- larly for assessing the question raised above about which aspect of a given bundle of interventions has the most impact on outcomes. Do workshops on participatory budgeting matter more than training CSOs? Randomly assigning workshops to some municipalities and training to others would allow us to find out. A second possibility for the second phase of the project is to reduce the number of municipalities treated, for budgetary reasons. Suppose the number of municipalities were reduced by half. The best option in this case is probably to randomize the control municipalities out of treatment, leaving half of the universe assigned to treatment and the other half as the control. Those municipalities assigned to treatment would be offered the full menu of interventions in the decentralization program. Of course, randomizing some municipalities out of treatment is sure to displease authorities in control municipalities as well as USAID offi- cials who would want to choose municipalities where they believe they have the greatest chances for success. Yet if the budget only allows for 268

OCR for page 151
 IMPROVING DEMOCRACY ASSISTANCE ods of working with selected local governments and civil society groups at the sub-county levels in the identified districts” (USAID/Uganda 2007a:16). Although the specific interventions were as yet undefined, the Request for Proposals suggested working with elected and appointed leaders, traditional leaders, women, youth, constituents, and CSOs at a subcounty level. Most likely, the program will consist of a bundle of interventions rather than a single activity. The fact that USAID plans to work with a sample of subcounties (within 10 preselected districts) makes this activity an excellent candidate for randomized evaluation. The number of subcounties within the 10 districts will almost certainly be enough to provide for a large N random- ized evaluation. Therefore, in planning interventions at the subcounty level, provision would be made for the random selection of treatment and control subcounties. One approach would be to randomly select half the subcounties within the 10 districts to be in the treatment group and receive the full bundle of interventions. The remainder of the subcounties would receive no interventions and thus serve as a control group. Alter- natively, subcounties could be stratified along district boundaries or other criteria, and random selection could take place within strata to facilitate equivalence on important dimensions. It is difficult to determine the most appropriate measurement tools without a better understanding of the exact interventions and the goals of the program. Regardless of the measurement approach, equivalent data would need to be collected in the subcounties in the control group as well as those in the treatment group. Ideally, baseline data would be collected before implementation of the program and then again during and after. USAID could also investigate the possibility of contributing to ongoing data collection efforts by the government or other agencies (such as the yearly school census, the service delivery survey, the Afrobarometer pub- lic opinion survey, and public expenditure tracking surveys) in order to provide the necessary funds for oversampling in the 10 selected districts. In most cases, oversampling will be necessary to obtain data that are rep- resentative at the subcounty level. Interparty Debates In an effort to support multiparty democracy, USAID envisions inter- ventions to “foster discussion and dialogue among the political parties so that difficult decisions can be achieved through compromise and nego- tiation before they result in conflict and stalemate” (USAID/Uganda 2007b:18). Building on successful interparty dialogues during the cam- paign before the 2006 presidential elections, USAID is considering spon- soring local-level political debates at the district level and below to engage

OCR for page 151
 IMPLEMENTING IMPACT EVALUATIONS IN THE FIELD citizens in multiparty politics more effectively. In thinking about how to evaluate such activities, it is natural to ask: How does exposure to inter- party debates impact citizen knowledge and attitudes about politics, voter turnout, voting outcomes, and political conflict at the local level? Randomized evaluation offers a powerful tool for assessing the impact of interparty dialogues. Five voting precincts could be randomly selected to be in the treatment group for each of 14 different parliamentary con- stituencies. Remaining precincts in the 14 constituencies would make up the control group. In each of the 70 treatment precincts, interparty debates would be held between candidates for Parliament in advance of the next election. Specifically, a given group of candidates vying for a single par- liamentary seat would participate in interparty debates in five different precincts within their own constituency. This would take place across 14 different groups of candidates in 14 different constituencies. Many outcomes of interest are already collected by the electoral com- mission—voter registration, voter turnout, and the percent vote for each candidate. If interparty candidate debates help mobilize candidates, there should be higher registration and turnout rates in treatment precincts. One might imagine also that debates inform citizens about lesser-known candi- dates and thus increase the vote for nonincumbent candidates or parties. Therefore, if debates create a more informed citizenry, there should be a smaller share of the vote for incumbents in treatment precincts. If, instead, debates remind voters of the greater experience and access to largess pos- sessed by the incumbent, the opposite effect would be evident. To gain greater power, a difference in difference estimation strategy21 could be used to evaluate changes from the last election in turnout and vote out- comes (assuming that the boundaries of the voting precincts are relatively stable since the last election and polling-station-level data are available for the last election). An analysis of the distance of control precincts from treatment precincts can also be performed to account for the fact that citizens in neighboring constituencies in the control group may attend or learn about debates in the treatment precincts. To assess the impact of interparty debates on local conflict, one could also compare measures of election-day violence and intimidation gath- ered by DEMGroup, party observers, or outside monitors. If resources were available to conduct surveys in treatment and con- trol precincts, the evaluation would provide an even richer perspective on citizen knowledge, attitudes, political tolerance, and behaviors, enabling a better understanding of the causal pathways linking debates with reg- istration, turnout, and vote choice. Ideally, pre- and posttreatment panel surveys would be carried out in treatment and control sites. Of course, 21 See Chapter 5 for a description of this evaluation design.

OCR for page 151
 IMPROVING DEMOCRACY ASSISTANCE care must be taken to ensure that the population surveyed in the treatment sites is comparable to those surveyed in the control sites. For example, it would be misleading to survey only those individuals who attended the debates in the treatment sites but to survey a random sample of individu- als in the control group (including those who would have attended if the debate were held in their area and those who would not have). A random sample of all adult citizens in both treatment and control groups would be more informative. While the field team in Peru described how a past project might have been designed in a way that permitted rigorous evaluation, the Uganda team focused on a multifaceted set of projects that were just getting started. Working with mission staff, the committee’s experts identified a series of planned interventions, each of which could be assessed using tools of randomized evaluation. Although these evaluation models do not cover every planned intervention currently under consideration by the Uganda mission, if implemented, they would provide substantial new evidence about the efficacy of USAID DG programming in Uganda. CHALLENgES IN APPLyINg RANDOMIzED EvALUATION TO Dg PROgRAMS The evaluation designs described above are the basis for the unani- mous conclusion of the field teams that randomized evaluations, apart from being valuable where they can be successfully applied, are also fea- sible designs for measuring the impact of (at least some) ongoing USAID DG projects. Yet demonstrating the feasibility of designing randomized evaluations that do not require significant modifications of “normal” DG projects does not imply that adopting them will not involve at least some trade-offs. Indeed, USAID staff and implementers in all three countries visited raised objections and concerns about some of the problems that randomized evaluations might pose. While several of these problems do, in fact, constitute real obstacles to program implementation or evaluation, the field teams concluded that alternatives exist in many cases that could help partially or wholly address the concerns that were raised. This sec- tion discusses these problems and how randomized evaluations could be designed to minimize them.22 Two important issues that are deferred and discussed separately—the former in the next chapter and the latter in Chapter 9—are the questions of what to do with projects that treat too few units to be suitable for randomized evaluation and problems arising from 22 See Savedoff et al (2006) for another discussion of objections to rigorous evaluations and ways they can be overcome.

OCR for page 151
 IMPLEMENTING IMPACT EVALUATIONS IN THE FIELD the incentives (or disincentives) that DG staff and implementers have to conduct impact evaluations and their current capabilities to do so. Randomly selecting units for treatment is simply not workable. Adopt- ing the principle of random assignment runs the risk that certain units that project designers would very much like to include in the treatment group will wind up being excluded from the program. For some USAID staff and implementers with whom the committee spoke, this was a major reason to resist adoption of randomized evaluations. It was pointed out, for example, that in many situations USAID and its implementers can only work with local authorities that accept their help. Moreover, it was suggested that units (municipalities, ministries, groups) that lacked the “political will” to work with USAID to fully implement the programs in question would not be likely to achieve successful outcomes and thus do not merit an investment of resources. It was also suggested that units with exemplary past performance sometimes appeared to be such sure bets for program success that excluding them from participation in the new project appeared wasteful. These are reasonable objections; however, accepting their merit need not imply jettisoning a randomized design. One option that satisfies the need for randomized selection of treatment units while also recognizing that rolling out a program in some units may not be feasible would be to select the set of units that are eligible for treatment on the basis of political will and other criteria that USAID believes maximize the chances for suc- cess and then to assign units randomly to treatment and control groups within this group of eligible units. This approach is also useful for situa- tions where USAID seeks to limit programs to needy or conflict-affected areas, as long as there are more units than USAID can possibly treat. Another option, suitable for situations where, for political or other reasons, allocating treatment to one or several units may be nonnego- tiable (i.e., the consensus among project designers is that a particular unit or units simply must be included in the treatment group), is to go ahead with random selection of units for treatment but leave aside a certain percentage of the project budget (e.g., 10 to 15 percent) to pay for the implementation of program activities in units that were not selected but that organizers feel must be included. In such a case the evaluation would be based on a comparison of the regular treated group (not including the added units) with the control group. Of course, one can always look as well at outcomes in the non-randomly selected—the “must have”—units. Yet comparing outcomes in such units to nontreated units would be less informative about the causal impact of the USAID intervention than comparing outcomes across the units that were randomly assigned to the treatment group and the control group.

OCR for page 151
0 IMPROVING DEMOCRACY ASSISTANCE It is unethical or impossible to preserve a control group. Is it ethical to deny treatment to control groups? This issue arises frequently in public health programs but may also be relevant in projects where, as with inter- ventions in the area of DG, the assistance is welfare improving even if not, strictly speaking, life saving. As with public health studies, the standard defense applies: Without an experiment, how do we know whether or not the intervention helps? USAID intervenes to assist DG all over the world. As in the public health field, it behooves us to know with as much confidence as possible what works and what does not. Continuing to channel scarce resources to projects that, once properly evaluated, turn out to have no positive impact is wasteful, particularly when properly executed randomized evaluations could put USAID in a position to iden- tify projects that do work and whose reach and impact could usefully be expanded with a shift in resources from those that have been found to be underperforming. A second defense of randomized assignments against the criticism that some units will go untreated is that, in any project being implemented across a large number of potential units, there will virtually always be untreated units. In the context of a decentralization project involving doz- ens of municipalities, it is simply not feasible for USAID to work with all of them; in the context of a project designed to support CSO development, it is simply not possible for USAID to work with every group. Given the impossibility of treating eery unit, the only question is how untreated units will be chosen. In many contexts it may be fairest, and most ethically defensible, to choose untreated units by lottery, as would be the case in a randomized evaluation. Finally, even if every unit is to be treated, it may be reasonable to delay treatment for a portion of the units by a randomized rollout. In this case, while some units (chosen by lottery) will get assistance first, others will have a delay before they receive assistance. Yet for the group that faces delay, this may be more than compensated by the possibility that the delayed group will either be spared an ineffective treatment or will receive improved assistance, since the initial phase of the rollout provides the basis for learning from a randomized impact study of the treatment’s effects. Isolating control from treatment groups is not feasible in practice. A third objection involves the great difficulty in preventing the effects of treated units from “spilling over” and affecting control units. For exam- ple, a project that provides support for CSOs to advocate improved ser- vice delivery may impact not only the area in which the CSOs are based but also neighboring areas (either because local governments fear similar mobilization and act to forestall it or because CSOs in neighboring areas become emboldened by the example of what their colleagues are doing

OCR for page 151
 IMPLEMENTING IMPACT EVALUATIONS IN THE FIELD next door and step up their own advocacy). Another example of spillover is when grassroots party activities in one locale yield benefits in other places, either because party contacts extend across administrative bound- aries or because changing attitudes are transmitted across familial and social networks. Whenever there are spillover effects (and there often are), the difference between the control and treatment groups is attenuated, and this will bias the evaluation toward a finding of no effect. Sometimes, design modifications can help minimize the likelihood of spillover. For example, in the context of the Peruvian decentralization proj- ect discussed earlier, randomizing at the provincial level might decrease the probability that district mayors are aware of treatments administered to other units. In this case all municipalities in a province would be in either the treatment group or the control group, thereby minimizing the likelihood of spillover from municipality to municipality (except insofar as they happen to be located adjacent to a provincial boundary). But while problematic for inference, spillover effects may be impor- tant to measure in their own right. In their study of deworming pro- grams in Western Kenya, for example, Miguel and Kremer (2004) found that deworming interventions are not cost-effective unless the positive externalities of the program that spill over into neighboring untreated communities are accounted for. Taking advantage of the fact that the treatment is randomly assigned across space, they estimate the size of these spillover effects and then use the estimates to calculate the true effects of the deworming program, which they find to be positive once the spillover effects are accounted for. Their study underscores that not just minimizing but also measuring contamination must be a core aspect of any well-designed randomized evaluation. A related problem is the possibility that donors from other countries might concentrate their programs in areas in which USAID is not under- taking program activities, thereby, as one program officer put it, “flooding the controls.” This may happen intentionally, when donors coordinate and divide up areas of focus to avoid duplication of efforts. Or projects not intended to directly influence democracy, such as programs to create entrepreneurs or regional cooperative associations, may in fact help the spread of democracy in the area being observed. If this occurs, the other donors’ interventions become a confounding factor associated with treat- ment, and this will almost certainly bias inferences about the effect of USAID interventions.23 One possible response to this issue is not to advertise the existence of 23 However, it might be pointed out that, if anything, this is likely to dilute the (it is hoped positive) effect of treatment. If other donors flood the controls and there is still a difference between groups, a causal effect of USAID’s intervention can be inferred. (At least, the effect of USAID relative to other donors can be evaluated.)

OCR for page 151
 IMPROVING DEMOCRACY ASSISTANCE control units. For example, in the context of a decentralization project it may be known that USAID is working in seven regions, but it need not be made publicly known which particular municipalities it is working with in each region. A second solution is to commit in advance to implement the project in all units (and to make this publicly known) but to roll it out gradually, using untreated units as a comparison group for treated units in the years before they are added to the intervention (as in the second design for the Peru decentralization program described earlier). Another option is to randomize different treatments across all municipalities. In other words, USAID would work with all municipalities in the seven regions (thereby leaving no municipalities to be flooded) but randomly assign different treatments to different municipalities (again, as discussed earlier for Peru). One final possibility is to engage other donors in con- ceptualizing the evaluation exercise. If multiple donors are implementing similar interventions, all would benefit from an impact evaluation of their projects. In such circumstances it may be possible to coordinate USAID’s activities with theirs to preserve a control group. It is hard to plan an evaluation (or stick to one) because mission objec- tives and programs change all the time. A common concern the field teams heard was that randomized evaluations are insufficiently flexible to be practical. As a political officer at the U.S. Embassy in Peru commented, the embassy is sometimes compelled to “put out fires.” For example, in an experimental evaluation of the impact of municipal-level interventions in mining towns, the embassy might have to intervene if a conflict broke out in a community. This may or may not pose an issue for causal inference. Some “fires” may be independent of treatment assignment—that is, they may be equally likely to occur in treated units as in control units. How- ever other “fires” may be products of the treatment. They may reflect, for example, the absence of a desired treatment among controls, which necessarily feel left out. This raises more serious issues. Unanticipated events that require additional interventions in either treatment or control communities must be recorded so that they can be taken into account in the final evaluation. Such events may make interpretation of the results more complicated, but the possibility that they might arise is not an argu- ment to forego randomized evaluations per se. In addition, missions may wish to adjust programming midstream, either by learning lessons from an early assessment of outcomes or by responding to new developments on the ground. Sometimes this is quite consistent with the purposes of a good evaluation. For example, if there is powerful evidence part way through that a project is working, USAID may wish to extend its reach into communities that were previously in the control group (medical trials are often abandoned early if there is robust

OCR for page 151
 IMPLEMENTING IMPACT EVALUATIONS IN THE FIELD evidence of the benefits, or dangerous consequences, of a treatment). The phrase “if there is powerful evidence” is crucial here. Since the whole pur- pose of the randomized evaluation is to generate evidence for a project’s success or failure, there is no trade-off whatsoever in abandoning it or in tweaking it midstream, if “powerful evidence” for the project’s efficacy has already emerged. A real trade-off presents itself only if the evidence for the project’s success or failure is still tentative. In such a situation a judgment call would have to be made about the relative importance of confirming what the initial evidence seems to suggest (which would require not altering the design of the randomized evaluation) or mov- ing ahead with the change in course (which might have the benefit of maximizing impact but risks acting on a hunch that may have been ill founded). The more difficult issue is when, as frequently happens, unforeseen challenges arise in project implementation that USAID thinks require slight adjustments in the interventions or sometimes the replacement of implementers. Changing the treatment part of the way through the process is, of course, not ideal. As long as the adjustments are consistent across the treatment group, however, there is no threat to causal inference (although it should be kept in mind that the ultimate evaluation measures a more complicated treatment). Whatever the source of the midstream correction, responsible officials will need to remember that the benefits of continuing with the rigorous evaluation design accrue agency-wide and are not limited to the particular mission or project. So the advantages of a midcourse correction for a project or mission will need to be balanced against the potential loss of valuable evaluation information that could be usefully applied to programs in other countries. Randomized evaluations are too complex; USAID does not have the expertise to design and oversee them. Staff both in the field and in Wash- ington consistently raised the objection that USAID is not well equipped to design and implement, or even simply oversee, randomized evalua- tions. This is a valid concern. While the idea of randomized evaluation is intuitive and easy to understand, the design of high-quality randomized evaluations requires additional academic training, specialized expertise, and good instincts for research design. It is likely that many (or most) USAID DG staff do not have training in research methods and causal infer- ence, thus making it difficult for them to evaluate the quality of proposed impact evaluations or to play a role in their design and implementation. The committee wants to emphasize that the guidance provided in this report should not be seen as a “cookbook” of ready-made evaluation designs for DG officers. It would be a mistake for USAID to endorse the typology of evaluation designs outlined in Chapter 5 and then require DG

OCR for page 151
 IMPROVING DEMOCRACY ASSISTANCE officers to put these new designs into practice without additional training or support. Because the issue of competence and capacity is so central to the prospect of improving evaluation in USAID DG programs, Chapter 9 is dedicated to providing recommendations about how USAID DG could make the necessary investments and provide appropriate incentives to encourage using impact evaluations of its projects where appropriate and feasible. It will cost too much to conduct randomized evaluations. Perhaps the most important objection the committee encountered in the field is that randomized evaluation will cost too much. In part, this is a question of USAID’s priorities. If the agency is committed to knowing whether important projects achieve an impact, it will need to commit the neces- sary resources to the task. But aside from whether the agency commits to higher quality evaluations, it is legitimate to ask how much more random- ized evaluation will cost than the procedures currently employed. The committee’s field teams were tasked with some detective work in an effort to answer this question. As discussed in Chapter 8, the committee discovered that USAID could not provide concrete information about how much it spends on monitoring and evaluation (M&E) every year, even for a subsample of DG programs. The committee therefore encouraged the field teams to explore the cost of current approaches by reviewing project documents and through discussions with mission staff. They, too, encountered insurmountable obstacles; project documents almost never provided line items for M&E and what was reported was not consistent from one project to another. Based on interviews with implementers, the field teams reported that nontrivial amounts of time were dedicated to the collection of output and outcome indicators and the monitoring of performance, but no team could arrive at any hard numbers related to current expenditures. The committee thus cannot answer the question of how much more it will cost to introduce baseline measures, data collec- tion for comparison groups, or random assignment, relative to current expenditures on M&E. At best it can be said that in a number of cases that the field teams examined, it seems that substantial improvements in all of these areas could be obtained for little or no additional cost, but that in other cases the costs could be substantial. Much depends on whether data are being collected from third parties or local governments versus being generated by surveys or other primary data collection by implementers, on whether surveys are already being used for the projects or would need to be developed specifically for the project in question, and on the specific outcomes that have to be measured in the treatment and control groups. As noted, in some cases—such as reducing the initial number of units treated in order to preserve a control group—an impact evaluation could

OCR for page 151
 IMPLEMENTING IMPACT EVALUATIONS IN THE FIELD actually save money compared to providing all groups with assistance immediately, before the effects of the project have been tested. But how much will a randomized evaluation cost? Answering this question requires two different calculations. The first is the straightfor- ward calculation of how much more it will cost to collect the necessary data. This will depend on the number of control and treatment units required for a useful random assignment; the more subtle the expected effects, the larger the number of units that will be required, with a corre- sponding increase in the cost of data collection. The factor to keep in mind is that, even if data collection is more costly in a randomized evaluation design, the potential benefit is that it would put USAID in a position to assess the impact of the project with much more confidence and to detect subtle improvements that might not be visible without a randomized design. The second, much trickier, calculation lies in assessing (1) the cost of selecting units at random, which may entail not implementing project activities in units where USAID might have reason to believe that the project will have a large positive impact and/or (2) going ahead with the implementation of project activities in units where USAID has reason to believe that the project will fail. Here the cost is less a direct expense than an opportunity cost. Again, these costs must be weighed against the potential benefit of being able to conclude whether or not the project worked. Note, however, that the latter type of cost (of directing program funds either to places where staff are convinced the project will not work or away from places where staff are convinced that it will) will be greater the more confident staff members are about whether or not (or where) an accurate prediction can be made about exactly where a project will be successful and where it will not. If it is already known whether (or where) a project will work, then randomized evaluations are not needed to answer this question. The real peril lies in believing wrongly that the consequences of a program are, in fact, known and allocating resources on that basis when the hypotheses behind a program have not been tested by impact evaluations. CONCLUSIONS The committee’s consultants believed they had demonstrated that at least some of the types of projects USAID is now undertaking could be subject to the most powerful impact evaluation designs—large N random- ized evaluations—within the normal parameters of the project design. For a majority of committee members, this provided a “proof of concept” that the designs would also be feasible in the sense that they would work in practice as well as in theory. However, one committee member

OCR for page 151
 IMPROVING DEMOCRACY ASSISTANCE with experience in actually managing DG programs remained skeptical as to whether the complexity and dynamic nature of DG programming would allow random assignment evaluation designs to be implemented successfully. The committee also notes that doing random assignment evaluations in the highly politicized field of democracy assistance will likely be controversial. It is, therefore, recommended in Chapter 9, as part of a broader effort to improve evaluations and learning regarding DG programs at USAID, that USAID begin with a limited but high-visibility initiative to provide a test of the feasibility and value of applying impact evaluation methods to a select number of its DG projects. REFERENCES Carrión, J.F., Zárate, P., and Seligson, M.A. 2007. The Political Culture of Democracy in Peru: 2006. Latin American Public Opinion Project (LAPOP), Vanderbilt University and Instituto de Estudios Peruanos, Lima, Peru. Available at http://stemason. anderbilt.edu/ files/gcfLNu/Peru_English_DIMS%000with0corrections0,pdf. Accessed on April 26, 2008. Dehn, J., and Svensson, J. 2003. Survey Tools for Assessing Performance in Service Delivery. Working Paper. Development Research Group, The World Bank, Washington, DC. GRADE. 2003. Grupo de Análisis para el Desarrollo, Linea de Base Rapida: Gobiernos Sub- nacionales e Indicadores de Desarrollo. Lima, Peru: GRADE. Gugerty, M.K., and Kremer, M. 2006. Outside Funding and the Dynamics of Participation in Community Associations. Background Papers. Washington, DC: World Bank. Available at http://siteresources.worldbank.org/INTPA/Resources/Training-Materials/OutsideFunding. pdf. Accessed on April 26, 2008 Miguel, E., and Kremer, M. 2004. Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities. Econometrica 72(1):159-217. Olken, B.A. 2007. Monitoring Corruption: Evidence from a Field Experiment in Indonesia. Journal of Political Economy 115:200-249. PRODES PMP. 2006. Pro Decentralization Performance Monitoring Plan, 2003-2006. Lima, Peru: ARD, Inc. PRODES PMP. 2007. Pro Decentralization Performance Monitoring Plan, Fifth Year Option, February 2007-February 2008. Lima, Peru: ARD, Inc. Savedoff, W.D., Levine, R., and Birdsall, N. 2006. When Will We Eer Learn? Improing Lies Through Impact Ealuation. Washington, DC: Center for Global Development. Seligson, M. 2006. The AmericasBarometer, 2006: Background to the Study. Available at: http://sitemason.anderbilt.edu/lapop/americasbarometer00eng. Accessed on February 23, 2008. USAID/Peru. 2002. Request for Proposals (RFP) No. 527-P-02-019, Strengthening of The Decentralization Process and Selected Sub-National Governments in Peru (“the Pro- Decentralization Program”). Lima, Peru: USAID/Peru. USAID/Uganda. 2007a. Request for Proposals (RFP): Strengthening Democratic Linkages in Uganda. Kampala, Uganda: USAID/Uganda. USAID/Uganda. 2007b. Request for Proposals (RFP): Strengthening Multi-Party Democracy. Kampala, Uganda: USAID/Uganda.