Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
7 Additional Impact Evaluation Designs and Essential Tools for Better Project Evaluations Introduction The previous chapter explored whether randomized evaluations could be more than just a theoretically appealing methodology but could also feasibly be designed for democracy and governance (DG) projects being implemented in the field by the U.S. Agency for International Develop- ment (USAID). This was done by describing a decentralization project in Peru and a series of democracy-strengthening activities in Uganda and by showing how randomized designs could be developed that would suit the implementation of these projects. Also addressed were some of the objections that the committeeâs field teams heard about the viability of adopting randomized evaluations more generally. While concerns about the impracticality of randomized evaluations must be taken seriously, in principle many of them could be dealt with through creative project design and/or greater flexibility in the selection of units for treatment or the timing of project rollout. The committee recognizes, however, that randomized designs are not always possible and alternatives need to be considered. This may be because of the costs, complexity, timing, or other details of the DG project. Thus this chapter focuses on other methods of impact evaluation for those cases where randomization is not feasible. Examples are given of ways that USAID can develop sound impact evaluations simply by giving more attention to baseline, outcome, and comparison group measurements. The chapter begins by addressing two questions regarding choices between the use of randomized designs or the other (comparison-based) impact 177
178 IMPROVING DEMOCRACY ASSISTANCE evaluation designs described in Chapter 5. First, how many of USAIDâs current projects appear suitable for randomized impact designs? Second, when projects are not suitable for randomized evaluations, what options are available and how should the other methods described in Chapter 5 be chosen and applied? How Often Are Randomized Evaluations Feasible? To help answer this question, project staff collected information about the DG activities that the USAID mission in Uganda had undertaken in recent years (see Appendix E for a list of these projects as well as those in Albania and Peru). The projects in Uganda included efforts designed to provide support for the Ugandan Parliament, strengthen political plu- ralism and the electoral process, and promote political participationâa fairly typical roster of projects and one that parallels those implemented by USAID missions in many countries. A team member then divided these projects into 10 major activities and scored them for (1) amenability of each activity to randomized impact evaluation and (2) where random- ized evaluation was not deemed possible, the benefits of adding other impact evaluation techniques (better baseline, outcome, or comparison group measures) to existing monitoring and evaluation (M&E) designs. In doing so, the committee recognizes that current USAID project moni- toring plans are largely designed to track an implementerâs progress in achieving agreed-upon outputs and outcomes. Our approach, therefore, is not to assess the quality of current monitoring plans but rather to assess and illustrate instances where additional information that could reveal the impact of DG projects is currently not being collected but could readily be acquired. The first finding of the analysis was that all 10 of the activities exam- ined used M&E plans that omitted collection of crucial information that would be needed if USAID sought to make impact evaluations of those activities. The committee does not mean to criticize current M&E plans, which focus on acquiring important information for program manage- ment and resource allocation. The committee wants to draw attention to the marked difference between the content of the currently mandated and universal M&E components of most DG projects and the information that would need to be acquired to conduct a sound and credible impact evaluation of project effects. The latter is a different task and, as noted, â This section is based on the work of Mame-Fatou Diagne, University of California, Berkeley. â See Chapter 2 for a discussion of the difference between current USAID project M&E plans and impact evaluations.
ADDITIONAL IMPACT EVALUATION DESIGNS AND ESSENTIAL TOOLS 179 may require different expertise in designs for project implementation and data collection than are currently part of USAIDâs routine activities. For example, unless collection of data from a nontreatment comparison group is an explicit part of the project design, there is no need to monitor whether contractors are collecting such data, and it will not normally be part of M&E activities. But without such data (including good baseline data) and a set of policy-relevant outcome measures, a projectâs actual effects, as opposed to the accomplishment of project tasks, such as the number of judges trained or improved municipal accounting systems established, cannot be determined. On a scale of 1 to 10, with 10 being the most complete and cred- ible plan for collecting data for impact evaluation, 9 of the 10 activities received a score of 1 and one received a score of 2. Again, this underlines the difference between the character of currently mandated M&E designs and impact evaluation designs. Nonetheless, on the positive side, 5 of the 10 activities were found to be, in principle, amenable to using randomized evaluation designs to determine project impacts; 4 other activities were found to be amenable to collection of baseline or nonrandom comparison group data that would significantly improve USAIDâs ability to know whether or not the activity in question had a positive impact. Seven of the 10 activities were found to be amenable to changes in how outcomes were measured that by themselves would markedly strengthen the moni- toring they were already doing. The measurement changes alone were judged to be capable of bringing the average ability to provide inferences about project outcomes from 1 to 3 on the 10-point scale, while the shift to collecting data for impact evaluation designs was found to be capable of raising the average score for making sound inferences of project effects to over 7. These are dramatic changes, and they underscore the teamâs conclusions about the large potential for USAID to more accurately and credibly assess the effects of its DG projects by adding efforts to collect impact evaluation data to its M&E designs, in at least this subset of its ongoing projects. While the scoring of these monitoring efforts is necessarily subjective and the ability to generalize from the efforts being implemented by a single mission is obviously limited, analysis of the Uganda missionâs DG activities nonetheless offers some useful lessons. First, it suggests that a number of avenues to improve knowledge of project effects are possible, ranging from simple changes in how outcomes are measured to more substantial yet feasible changes in evaluation design. Second, it suggests an answer to the question posed earlier about the frequency with which â The teamâs conversations with both the mission and the implementers in Uganda in- cluded a number of discussions about the problems of measurement for DG projects.
180 IMPROVING DEMOCRACY ASSISTANCE randomized evaluations are likely to be feasible. In Uganda at least, ran- domized evaluation was judged to be a feasible evaluation design strat- egy for 5 of the 10 activities being undertaken, and an additional 4 out of the 10 were judged to be amenable to nonrandomized yet systematic baseline/control group designs. In effect, then, 9 out of 10 programs in Uganda could have potentially benefited from the approaches presented in this report. This is a much larger share than is commonly assumed by the USAID staff with whom the committee consulted in the course of its investigations. Critics are right that randomization is often not possible, however, and the team judged that for evaluating the impact of one-half of the activities examined, only other forms of evaluation designs (i.e., the large N nonrandom comparison or small N and single-case comparisons dis- cussed in Chapter 5) would be feasible. Yet the teamâs finding that one- half of the DG activities it examined were amenable to randomized design is a higher proportion than most critics would expect. This would indi- cate that claims that randomized impact evaluations are only ârarelyâ or âhardly everâ possible may be too pessimistic. Perhaps even more important, fully 9 out of 10 of these activities were found to be suitable for some form of the impact evaluation designs described in Chapter 5. Given that none of these activities in Uganda are currently collecting the kind of information needed for such impact evaluations, but 9 out of 10 could potentially do so, USAID appears to have a great deal of choice and flexibility in deciding how much, and whether, to increase the number of programs and the amount of information it collects to determine the effects of its DG activities. As noted in Chapter 5, randomized evaluations require that there be a very large number of units across which the projects in question might, at least in principle, be implemented, as well as that program design- ers be able to choose these units randomly. Many high-priority USAID DG projectsâfor example, those that focus on strengthening individual ministries, professional associations, or institutions; those that support the creation of vital new legislation or constitutions; or those that build capacity to achieve national-level goals such as more effective election administrationâdo not meet these criteria. Such projects are critical to achieving the larger goal of improving democratic governance. Precisely because they are important, improving USAIDâs ability to evaluate the impact of the millions of dollars that it spends each year on implementing such projects should be accorded a high priority. The next section addresses the question of what to do to carry out impact evaluations in situations of this kind. First, the general issue is dis- cussed and then the other evaluation techniques highlighted in Chapter 5, with specific examples from the field are discussed. Finally, the discussion
ADDITIONAL IMPACT EVALUATION DESIGNS AND ESSENTIAL TOOLS 181 takes up the special, but common, case of how to design the most credible impact evaluations when there is only a single unit of analysis. Designing Impact Evaluations When Randomization Is Not Possible As stressed above, all sound and credible impact evaluation designs share three characteristics: (1) they collect reliable and valid measures of the outcome that the project is designed to affect, (2) they collect such outcome measures both before and after the project is implemented, and (3) they compare outcomes in both the units that are treated and an appropriately selected set of units that are not. As long as the number of units (N) to be treated is greater than one, all three of these attributes of impact evaluation are possible. The major difference between randomized evaluations and other methodologies lies in the degree to which project designers need to concern themselves with the number and selection of control units. In a randomized evaluation the law of large numbers does the job of ensuring that the treatment and control groups will be (within the limits of statistical significance) identical across all the factors that might affect the project impacts being measured. When random assign- ment is not possible, project designers must pay close attention to the factors that might be associated with inclusion in the control or treatment groupsâwhat social scientists refer to as âselection biasââand the effects of those factors on the differences found between the control and treat- ment units. These are the approaches referred to as large N and small N comparisons in Chapter 5. Aside from the fact that the implementer does not select treated units at random, the examples described below are very similar to the random- ized designs. In particular, they share the key characteristics that reliable and valid measures of project outcomes still must be collected both before and after project implementation and for treatment and comparison groups. As with randomized designs, the discussion proceeds by provid- ing examples of best practices. All four examples highlight the importance of finding an appropriate way to identify a control group, while the latter two also emphasize creative ways to improve measurement. National âBarometerâ Surveys as a Means to Design Impact Evaluations for Localized USAID Project Interventions For a variety of reasons, USAID often implements programs at a subnational level, applying its efforts in a selected set of municipalities or departments or regions. Often the selection of these regions is determined by programmatic considerations. For example, USAID might determine
182 IMPROVING DEMOCRACY ASSISTANCE that it wants to focus its resources on the poorest areas of the country or on areas that have suffered the most from civil conflicts or have been hit with natural disasters. In other cases USAID decides to focus on munici- palities or regions that look the most promising for the success of a par- ticular intervention. In still other cases, USAID engages with other donors to âdivide up the pieâ with, for example, the European Union agreeing to work in the north while USAID works in the south. Finally, there may be entirely idiosyncratic reasons for the choice of where to work (and where not to work) related to the preferences of individual host governments or implementers. In each subnational project the principle of randomized selection is violated and the possible confounding effect of âselection biasâ would be an important factor in designing an impact evaluation. The nonrandom selection may bias the impact so that, ceterius paribus, the results may be better than they would have been had randomization been used to select the treatment area or they could be worse. It is impossible to know beforehand exactly what to expect. The point is that those who wish to study impact will worry that selection bias by itself could be responsible for any measured âimpact,â rather than the project itself. Consider a project carried out in an exceptionally poor area. One pos- sible outcome is that the area is so poor, and conditions so grim, that short of extraordinary investment, citizens will not really notice a difference. Similarly, in a post-civil war conflict, feelings of hatred and distrust may be so deeply ingrained that project investments will be ignored entirely. In these cases, even though the project may have been designed well, any impact is imperceptible. On the other hand, in both cases, the very low starting point suggests (as noted in the Peru example below) that a âregression to the meanâ is inevitable and therefore improvements will occur with or without the project intervention. In such a case a positive impact might mistakenly be attributed to the project when, in fact, the gains are occurring for reasons entirely unrelated to the inputs. When randomization is not possible, but selection of multiple treat- ment and control areas is, conditions are ideal for the âsecond-bestâ method of large N nonrandomized designs. This sort of design is often referred to as âdifference in differenceâ (DD; Bertrand et al 2004). The objection to this approach, however, is that USAID would be spending its limited resources to study regions or groups in which it does not have projects and may not plan to have any. The committee believes that this entirely understandable (indeed compelling) reason alone constrains many DG programs and project implementers from considering a design that would be seen as âwastingâ money and effort on studies in areas where USAID is not working. The committee believes that USAID already posseses the ability to
ADDITIONAL IMPACT EVALUATION DESIGNS AND ESSENTIAL TOOLS 183 overcome this problem of âwasting moneyâ on seemingly irrelevant con- trols without significant additional investment of resources. The agencyâs Latin American and Caribbean bureau, for example, is already applying this methodology in some of its projects in a limited number of instances. The approach to reduce (but not eliminate) the risks of potentially mis- leading conclusions is to utilize the increasingly prevalent public opinion surveys being carried out in Latin America, Africa, and Asia, collectively known as the Barometer surveys. High-quality nationally representative surveys are regularly being carried out by consortia of universities and research institutions, many with the assistance of USAID but also with the support of other donors, such as the Inter-American Development Bank, the United Nations Development Program, the European Union, and local universities in the United States and abroad. These surveys provide fairly precise and reliable estimates of the âstate of democracyâ at the grassroots level, by producing a wide variety of indicators. For example, the surveys reveal the frequency and nature of corruption, victimization, and the level of citizen participation in local government, civil society, and the judicial process. They also produce measures of satisfaction with institutions such as town councils, regional administrations, the national legislature, courts, and political parties and the willingness of citizens to support key demo- cratic principles such as majority rule and tolerance for minority rights. These surveys also allow for disaggregation by factors such as gender, level of urbanization, region, and age cohort. Given that investments are already being made in the Barometer surveys, they provide a ânaturalâ and no-added-cost control group to studies of project impact. They provide, in effect, a picture of the âstate of the nationâ against which special project areas can be measured. In other words, USAID would continue to gather baseline and follow-up surveys in its project towns, municipalities, or regions and thus concen- trate its limited funds on collecting detailed impact data for the places or institutions in which it is carrying out its projects. It would not need to carry out interviews of control groups for which it does not have ongo- ing projects. The national-level control group, however, could be used to show differences between the nation and the project areas in terms of not only poverty, degree of urbanization, and so forth but also many of the project impact measurements that USAID requires to determine project success or failure. For example, if a project goal is to increase participation of rural women in local government, comparisons could be made between the baseline and the national averages, and then, following the DD logic, â The committee believes, but was unable to document, that this method has been utilized in some other programs in Africa.
184 IMPROVING DEMOCRACY ASSISTANCE comparisons would be made over time as the project impact is supposed to be occurring. There are several recent examples to illustrate this. For many years USAID focused a considerable component of its DG projects in Guatemala on institution building at the national level, especially the legislature. Surveys carried out by the Latin American Public Opinion Project as part of its Americas Barometer studies, found a deep distrust in those institu- tions, despite years of effort and investment. It also found special prob- lems in the highland indigenous areas. In part as a result of those surveys, the DG programs in Guatemala began to shift, focusing more on citizens and less on institutions. As part of that strategy, every two years national samples were carried out, along with targeted special samples (what USAID calls âoversamplesâ) in the highland indigenous municipalities. A finding from those surveys was the low level of political participation among some sectors of the population. In 2006 those surveys were used to focus the âget out the voteâ campaign for the 2007 election, a critical one in which a former military officer was a leading candidate. In Ecuador a series of specialized samples have been drawn in specific municipalities, with the results being systematically compared to national samples, drawn every two years since 2001. CARE, in cooperation with the International Migration Organization, has been working in a series of municipalities along the border with Colombia, a region in which the possible spread of narco-guerrilla activities could have an adverse impact on Ecuador. Thus the municipalities were not selected at random, but national-level survey data have allowed for comparison of starting levels, so that those implementing the project would have far more than anec- dotal information about the level of citizen participation in and satisfac- tion with local government. The survey data also allow for comparisons over time to see if trends in the project areas are more favorable than in the nation as a whole. Similar efforts have taken and are taking place in Honduras, Nicaragua, Colombia, Peru, and Bolivia. Surveys have also increasingly been used to measure the impact of anticorruption programs, in some cases by comparing âbeforeâ and âafterâ impacts on a specific sector (e.g., health in Colombia) and in other cases comparing the results for the nation as a whole before and after implementation of an anticorruption program (Seligson 2002, 2006; Seligson and Recanatini 2003). The most recent survey of citizen percep- tions of and experience with corruption, supported by USAID/Albania, was released while the committeeâs team was in Albania (Institute for Development Research Alternatives 2007). For this approach to be successful, national surveys, as well as spe- cific surveys carried out in project areas, need to be at least minimally coordinated so that the questions asked in both are identical. It is well
ADDITIONAL IMPACT EVALUATION DESIGNS AND ESSENTIAL TOOLS 185 known that small differences in question wording or scaling can substan- tially affect the pattern of responses. If, for example, local government participation is an impact objective of the mission, problems will arise if the national sample asks respondents whether they have attended a local government meeting and the project sample asks how many times in the past 12 months they attended a local government meeting. There are two potential objections to this approach. The first is cost. Surveys are thought to be expensive, but often the costs appear to be larger than they really are. In many of the countries in which USAID has democratization programs, the cost of a well-administered survey can be quite reasonable. A second objection readers may have to the DD approach is that the target (or project) areas are indeed different from the national samples in many of the ways mentioned above. Often they are poorer and more rural and therefore are expected not only to begin at levels below the nation as a whole but also to perhaps exhibit slower progress. One of the strengths of this design is that such differences can be detected and noted when the baseline survey data are collected. To correct for those differ- ences, the survey analysis can then use an analysis-of-variance design, in which the national sample becomes merely one of the groups being compared to the various treatment regions or municipalities. Covariates â Costs vary directly by hourly wages in any given country. In low-wage countries, surveys can be quite inexpensive. For example, surveys in many Latin American and African coun- tries can be conducted for $15 to $25 per interview (sometimes less) as an all-inclusive cost (sample design, pretests, training, fieldwork, and data entry). For a typical sample of 1,200 respondents (which would provide sample confidence intervals of Â±3.0 percent), total costs to obtain the data would be about $30,000. Of course, that is for one round of interviews; if the typical project involves a baseline survey followed by an end-of-project survey to measure impact, those costs would double. â Gathering the data is one cost, but analysis is another. The cost of analysis depends entirely on the price of contracting with individuals qualified to analyze such data. At a minimum, such individuals should hold a masterâs degree in the social sciences, with sev- eral courses in statistics. Individuals with such qualifications are often available in target countries, and an extensive analysis of the data could be obtained in many for $20,000 or even less. Unfortunately, many of the studies the committee has seen conducted for USAID limit themselves to reporting percentages and summary statistics. Analysis of that type is rarely useful, since indices of variables normally need to be created, logistic and OLS regression techniques must be applied, and reporting of significance levels and confidence intervals is required. For example, if the consultantâs report states that the baseline study finds 10 percent of respondents attending municipal meetings in both the control and ex- perimental areas, and the end-of-project survey finds that the treatment area has risen to 15 percent but the control group has also risen to 12 percent, it would be important to know if the change in the treatment group is statistically significant and if the increase in the control group was also significant. Thus USAID needs to be certain it has hired qualified individu- als and obtained an appropriate level of statistical analysis to make the analysis useful for determining project impact.
186 IMPROVING DEMOCRACY ASSISTANCE can and have been used to statistically remove the impact of the differ- ences between the national sample and the treatment groups. Hence, if the targeted areas are, on average, poorer or exhibit lower average levels of education, those variables can be included as covariates to âremoveâ their impact, after which the nation and the treatment areas can be more effectively compared. There are certainly possible flaws in this sort of analysis; for example, if there are unmeasured differences that are not known and/or cannot be controlled for statistically, the findings could be deceptive. But when randomized assignment cannot be used, this method can provide a good alternative. Since in many cases missions will not be able to select their treatment areas randomly, the ânational controlâ sample offers a reason- able way of measuring project impact. Finally, it is important to add that survey samples should not be used when little is known about the expected project impact. Surveys are best used when researchers already have a good idea of how to measure the expected impact. For example, in the illustration mentioned above, it should be relatively easy to specify what increased participation means, by devising questions on frequency of attendance at town meetings, municipal council meetings, district meetings, and the like. But when a project involves less well-researched areas, focus groups should be the instrument of choice until researchers more fully understand what is going on. Focus groups can then lead to more systematic evaluation via surveys. Strengthening Parties: An Example from Peru Another example of an impact evaluation design when randomiza- tion is not possible comes from Peru, where one of USAIDâs programmatic goals is to strengthen political parties. An idea that has been considered by the Peru USAID mission that would serve this goal and reinforce the parallel goal of promoting decentralization is to provide assistance to â Another factor to consider with respect to the use of surveys is the size and nature of the sample size of both the treatment and the control groups. The key factor here is the change that the project is expected to make on the key variables being studied. For example, if, again, the goal of the project is to increase participation in local government, what is the target increase that has been set? If the increase is 3 percent, a sample of 500 respondents will be too small, since a sampling error of about Â±4.5 percent would emerge from a sample of that size. This means that the project evaluation would be subject to a Type II error, in which the expected impact did indeed occur, but the sample size was too small to detect it. Ideally, the control group(s) should be of the same size as the treatment group in order to maintain similar confidence intervals for the measurement of project impact/nonimpact. â This discussion is drawn from the report of a field team led by Thad Dunning, Yale University.
ADDITIONAL IMPACT EVALUATION DESIGNS AND ESSENTIAL TOOLS 187 the major national-level parties in opening or strengthening local offices. Because of the large number of municipalities in which such offices might, in principle, be opened or strengthened, such a program might seem like a good candidate for a randomized evaluation. To set up the ideal condi- tions for an impact evaluation, USAID or the local implementer would randomly select municipalities in which to establish or strengthen local parties from a set of acceptable municipalities. Local parties would have to accept that USAID or the contractor would select the municipalities. However, when and where a political party chooses to open (or allo- cate resources to strengthen the operations of) a municipal office is purely the business of the political party. For USAID to make such decisions would be to go well beyond its mandate of supporting good governance more generally. From a project evaluation standpoint, however, the prob- lem is that if the parties themselves choose where to open (or allocate resources to strengthen the operations of) local offices, the design would be nonrandom. If several years into the project USAID finds political parties to be stronger in the treatment municipalities, was this due to the project or to the fact that the parties selected those local branches that were already in the process of strengthening themselves? Unless the proj- ect also provided for some local branches that the national parties did not select for funding, which likely is not feasible, it would not be possible to answer this question. Moreover, if outcomes are not tracked in municipalities in which USAID partners do not support local party offices (i.e., controls), any inferences may be especially misleading. Suppose measures of local party strength are taken today and again in five years and an increase is found. Is this due to the effect of party-strengthening activities supported by USAID? Or is it due to some other factor, such as a change from an elec- toral system with preferential voting to closed party lists, which would tend to strengthen party discipline, including, perhaps, that of local par- ties? With a control group of municipalities, it could be tested whether they too had experienced a growth in party strength (in which case the cause was most likely the law, which affects all municipalities in the country, not the USAID program, which was present only in some). The point is that without data on any comparison group to provide controls, â Such a change is currently being considered in Peru. In the current electoral system, there is proportional representation at the department level, and voters vote for party lists but can indicate which candidate on the list they prefer. According to a range of research on the topic, this can create incentives for candidates to cultivate personal reputations and also makes the party label less important to candidates. Under a closed-list system, voters simply vote for the party ticket, and party leaders may decide the order of candidates on the list. This may tend to increase party discipline and cohesion (as well as the internal power of party elites).
188 IMPROVING DEMOCRACY ASSISTANCE it will be impossible to separate the effect of USAID local activities from the effect of the law. So at a minimum, collecting data in a set of control municipalities would be highly advantageous. Thus, even if USAID gives political parties full control over which municipalities they choose for party strengthening with USAID assistance, USAID would benefit from seeking a list of those municipalities and choosing to also gather data from a sample of municipalities not on the list, to serve as a (nonrandom) comparison group. When units cannot be randomly assigned to assistance or control groups, the challenge for an evaluator is to identify an appropriate con- trol groupâone that approximates what the treatment group would have looked like in the absence of the intervention. In this context this would mean identifying municipalities that the parties do not select that are in all other ways similar to the municipalities in which the parties elect to work. Statistical proceduresâin particular, propensity score matching estimatorsâhave been developed to assist in the process of carefully matching units to approximate a randomized design. Alternatively, evalu- ators can exploit the discontinuities that exist when treatment is assigned based on a unitâs value on a single continuous measure. For example, if parties elected to work in the top 20 percent of municipalities in terms of their base of support, a comparison could be constructed that exploited the fact that those just above the 20 percent threshold are quite similar to those just below. These procedures require high-quality data on the characteristics of units that were and were not selected, as well as an understanding of the factors that contributed to the selection process. But as discussed in Chapter 5, these approaches have already been employed with impres- sive results in other settings not too dissimilar from some DG activities. The larger point is that creativity can help overcome some of the potential obstacles to stronger research designs. And as long as they include a con- trol group and sound pre- and postmeasurements, even nonrandomized designs can provide the basis for credible impact evaluations; in principle they can offer considerably more information for assessing project effects than is usually obtained in current DG M&E activities. Supporting an Inclusive Political System in Uganda Another example is the project sponsored by USAIDâs Uganda mis- sion to promote the development of an inclusive political system. A key objective of this effort is to empower women and other marginalized citi- â This discussion and the following one draw on work by a team led by Devra Moehler, Cornell University.
ADDITIONAL IMPACT EVALUATION DESIGNS AND ESSENTIAL TOOLS 189 zens to lobby district and political party leaders on issues of importance to them, such as activities for the disabled. To achieve this objective, small grants are to be provided to a small number of civil society organizations (CSOs) to allow them to carry out programs in this area. The objective is certainly worthy, but it is not amenable to randomized evaluation with- out a substantial increase in the number of funded CSOs (see Chapter 6). How, then, can it be determined whether the money spent on the small grant program is having the desired effect? The current M&E plan for the project involves a participatory evalua- tion, primarily an analysis of survey data on whether respondents thought the projects âwere helpful or very helpful,â supplemented by discussions with recipient organizations. A major limitation of this approach is the lack of a comparison group; data were collected only from groups or citizens who received USAID support (i.e., that were âtreatedâ) and no effort was made to collect additional data from groups or citizens who did not receive USAID support (i.e., that could serve as a âcontrolâ). Any changes identified in the data attributed to the project might just as eas- ily have been caused by confounding trends that happened to be taking place at the same time and that affected all communities (the project was implemented during an election period, so the more general effects of electoral mobilization cannot be ruled out as an alternative explanation for the observed changes in lobbying activism). Even in a small N design, an impact evaluation design (as opposed to the current M&E approach) that tracks trends both before and after a program is implemented and explicitly identifies untreated units for which comparable outcomes could be measured would provide much greater confidence in any inferences about the projectâs actual effects. If there are large amounts of data, the techniques described earlier (propensity score matching, regression discontinuity) can be employed. In this context, however, there is no substitute for careful, qualitatively matched comparisons. For example, if three districts were selected in which to implement the program, the evaluator would need to identify three additional districts that are similar on a set of variables believed to be associated with the targeted outcomes (e.g., income, government capacity, infrastructure). More qualitative approaches mirror the logic underlying the quantitative techniquesâthe goal is to identify a relevant counterfactual in order to distinguish the impact of the program from spa- tial or temporal trends that, while outside the ambit of the DG assistance program, could influence outcomes in the areas being observed. The measurement strategy in the existing M&E plan could also be significantly improved. The use of subjective assessments of activities by their participants raises two concerns: (1) because they are subjective rather than objective and (2) because the satisfaction of participants (par-
190 IMPROVING DEMOCRACY ASSISTANCE ties, CSOs, etc.) is not necessarily the same thing as project success and thus cannot provide reliable information about projectâs impact. So one major area where improvement would be possible is providing additional external or objective measurements of program success (e.g., how much more funding for help for the disabled was actually granted to districts where CSOs received USAID assistance than was granted to otherwise comparable districts?). Building the Capacity of the Parliament in Uganda Another example is the case of the bundle of USAID-sponsored activi- ties designed to build the capacity of the Ugandan parliament through the sponsorship of field visits, public dialogues, and consultative workshops for members of parliament and parliamentary committee staff regarding specific issues such as corruption, family planning/reproductive health, and people with disabilities. The project sponsored fact-finding monitor- ing and supervisory field visits to 35 districts, including a number in Northern Uganda, where many members of parliament and parliamen- tary staff rarely venture. Again, the goals of the project are worthy and the activities appear to be well conceived; however, the project is not amenable to randomized evaluation. How can it be known whether or not the money spent on project activities had any demonstrable positive effect? Did members of parliament who participated in these activities behave differently than those who did not? As is often the case with such projects, the principal monitoring method for these activities involved the collection of quarterly data on âoutputsâ (i.e., the number of public meetings attended by parliamentary committee members at the local level, the number of CSOs submitting written comments to parliamentary committee hearings, etc.) rather than âoutcomesâ (such as the impact that workshop attendance had on infor- mation acquisition, job performance, or other aspects of future behavior). Also, the reports submitted by the implementing contractor do not pro- vide much information on how the locations where the various public meetings took place or the participants who were invited to attend were selectedâboth of which are crucial for ruling out selection effects. The indicators measured by the contractor as part of the performance mea- surement plan of the project were used as indicators of project success. However, because of the absence of a control group, it is impossible to disentangle time-varying unobserved trends from the impact of the proj- ect. For example, it is difficult to conclude that an increase in the number of parliamentary committees responding to CSOs with briefings and dia- logues is an indication of project success. Such a change could reflect other (local) dynamics, the impact of other donor programs, the impact of the
ADDITIONAL IMPACT EVALUATION DESIGNS AND ESSENTIAL TOOLS 191 project of interest, or a combination of these. Similarly, in the absence of a counterfactual, the fact that the Persons with Disabilities Act was passed and enacted without executive initiative or support cannot be assumed to reflect the impact of the project. As with the projects described previously, an evaluation design that furnishes more information for assessing impact than the current M&E approach is possible. First, assessing the impact of these initiatives would require some measurement of outcomes among a control group of members of parliament who were not exposed to the field visits, public dialogues, and consultative workshops. Perhaps with the intervention defined so broadly, identifying a control group is too difficult. By focus- ing on a more narrow set of activities, such as the opportunity for mem- bers of parliament to participate in field visits or facilitated consultative meetings between parliamentary committees and their constituencies, envisioning a reasonable control group is more feasible. For example, if not all members of parliament are going to participate in field visits, one simply needs to understand the selection process for members (and the differences that exist between participants and nonparticipants) in order to rule out characteristics correlated with participation in the program that might account for any observed differences after the field visits (i.e., members of parliament already engaged in the conflict elect to take part in a field visit to Northern Uganda). It might be possible to facilitate a series of consultative meetings for one committee at a time and to compare how behavior changes in that committee to other similar committees that had not yet benefited from the program. In terms of the measurement of impact, one simple improvement could involve interviewing members of parliament about their actions and opinions rather than their perceptions of the usefulness of program activities. For example, instead of (or in addition to) asking, âIf you partici- pated or were aware of these activities, how useful were they in helping to generate government action on the problems in Northern Uganda?â (the current questionnaire item), a better approach would be to ask members of parliament at the beginning and after the program about their opinions on the conflict in Northern Uganda and about what they thought should be done and any action they have taken or intend to take. Questions aimed at measuring precisely what actions, if any, members of parliament or parliamentary committees took following the field visits would provide a better sense of the effects on behavior. If these questions were asked of both participants and nonparticipants, analysis of the differences between âtreatmentâ and âcontrolâ members of parliament would be possible. Even if these questions were asked only of participants but both before the intervention and afterward, analysis of the changes in the opinions and actions of âtreatmentâ members of parliament would be possible.
192 IMPROVING DEMOCRACY ASSISTANCE The advantage of this type of evaluation design is that it permits analysis of changes or differences in membersâ actual opinions and actions rather than their subjective assessment of the âusefulnessâ of programs. In addition, collecting information on the basic characteristics of those mem- bers who participated and those who did not would allow some statistical matching of the two groups to better determine how much the USAID DG program, as opposed to other prior characteristics of the members, contributed to any observed differences between the two groups in their subsequent actions and opinions. As with the two other projects described earlier, implementing the proposed changes involves trade-offs, but the team concluded that, if USAID wished to learn more about the precise effectiveness of these pro- grams, there is substantial opportunity to develop impact evaluations on these activities, even without using randomized designs. What to Do When There Is Only One Unit of Analysis10 Many USAID projects involve interventions designed to affect a sin- gle unit of analysis. Such interventions are among the most important DG-promoting activities that the agency underwrites. But for the rea- sons explained in Chapter 5, they are also among the most difficult to evaluate. For example, a major part of USAIDâs DG-related activities in Albania involves increasing the effectiveness and fairness of legal-sector institu- tions. While critically important to the missionâs goals, almost none of the rule-of-law activities are amenable to randomized evaluation or other methods that exploit comparisons with untreated units. This is because they each deal with (1) technical assistance to a single bureaucracy (e.g., Inspectorate of the High Council of Justice, Inspectorate of the Ministry of Justice, High Inspectorate for the Declaration and Audit of Assets, Citizens Advocacy Office, and National Chamber of Advocates); (2) sup- port for the preparation of a particular piece of legislation (e.g., Freedom of Information Act and Administrative Procedures Code, a new conflict- of-interest law, and a new press law); or (3) support for a single activity, such as implementation of an annual corruption survey. For a randomized evaluation of the efficacy of these activities to be possible, they would have to be, in principle, able to be implemented across a large number of units, which these are not. There is only one Inspectorate of the High Council of Justice, only one conflict-of-interest law being prepared, and only one National Chamber of Advocates being supported, so it is not 10â This section and the next one draw on the work of a team led by Dan Posner, University of California, Los Angeles.
ADDITIONAL IMPACT EVALUATION DESIGNS AND ESSENTIAL TOOLS 193 possible to compare the impact of support for these activities both where they are and are not being supported and certainly not across multiple units. The best way to evaluate the success of these activities is to iden- tify the outcomes they are designed to affect, measure the outcomes both before and after the activities have been undertaken, and compare these measures. Collecting high-quality baseline and follow-up data, the former stretching back as far in time as possible, is the primary tool for impact evaluation in such a situation. When outcome data show a marked shift subsequent to an intervention and examination of other possible events or trends shows that they did not correspond to this shift, a credible case can be made for the interventionâs impact. One problem, however, is that finding appropriate measures of the outcomes that the activities are designed to affect is frequently far from straightforward. For example, the goals of the technical assistance to the Inspectorates of the High Council of Justice and the Ministry of Justice are to improve the transparency and accountability of the judiciary and to increase public confidence in judicial integrity. The latter can be measured fairly easily using public opinion polls administered before and after the period during which technical assistance was offered and then compar- ing the results. However, measuring the degree to which the judiciary is transparent and accountable is much more difficult. Part of the problem stems from the fact that the true level of transparency and accountability in the judiciary can only be ascertained vis-Ã -vis an (unknown) set of activities that should be brought to light and an (unknown) level of mal- feasance that needs to be addressed. For example, suppose that, following implementation of a program designed to support the Inspectorate of the High Council of Justice, three judges are brought up on charges of corrup- tion. Should this be taken as a sign that the activities worked in generat- ing greater accountability? Compared to a baseline of no prosecutions, the answer is probably yes, at least to some degree, although one would also want to know whether prosecutions were selective, based on politi- cal reasons. But knowing just how effective the activities were depends on whether there were just three corrupt judges who should have been prosecuted or whether there were, in fact, 20, in which case prosecuting the three only scratched the surface of the problem. To be sure, 3 out of 20 is better than none, so the program can be judged to have had a positive impact in at least some sense. But knowing the absolute level of effec- tiveness of the program may be elusive. Parallel problems affect other rule-of-law initiatives, such as efforts to improve the ability of lawyers to police themselves. A slightly different evaluation problem arises with respect to activi- ties designed to support the drafting of various pieces of legislation. One fairly straightforward measure of success in this area is simply whether
194 IMPROVING DEMOCRACY ASSISTANCE or not the law was actually drafted and, if so, whether it included lan- guage that will demonstrably strengthen the rule of law. But assessing whether or not USAIDâs support had any impact requires weighing a counterfactual question: Would the legislation have been drafted without USAIDâs support and what would it have looked like? If the answers to these questions are that the legislation would not have been drafted or that the language in the resulting law would not have been optimal, the support from USAID can be judged to have been successful to the extent that the result observed is better than this counterfactual outcome. The broader problem, however, is that achieving the overarching strategic objective of strengthening the rule of law will involve more than just get- ting legislation drafted; it will involve getting legislation passed and then having it enforced. The pointâechoing a theme from Chapter 3âis that the measurable outcome of the USAID-sponsored activity is several steps removed from the true goals of the intervention, and any assessment of âsuccessâ in these areas must be interpreted in this light. Proper measure- ment of project impact must move beyond proximate questions (were the institutions created?) to more distant and policy-relevant ones (have the outcomes that the existence of the new institutions were hypothesized to affect been altered in a positive way?). Answering the second question requires the existence of high-quality baseline data, preferably stretching back as far in time as possible so as to be able to distinguish general trends from project effects. Additional Techniques to Aid Project Evaluation When N = 1 In addition to collecting high-quality baseline and follow-up data, two other techniques can aid project evaluators in making sound judg- ments about project efficacy. The first is to explicitly attempt to identify and rule out alternative explanations. If what looks like a project effect is identified, evaluators must ask what other factors outside the scope of the project might have caused the observed outcome. Can they be ruled out? For example, suppose it is found that the passage of a new anticorruption law whose drafting was sponsored by USAID corresponds with a drop in corruption, as measured in national surveys. It would be important to think carefully about other factors that might have occurred at the same time as passage of the new legislation which might also account for the drop in measured corruption. Perhaps a crusading anticorruption minister was appointed right after the new legislation was passed. Might her presence at the helm of a key ministry have caused the change? One way to rule out this possibility would be to see whether larger changes in perceived corruption were evidenced in her ministry than in others or whether perceived corruption increased again after she left officeâboth
ADDITIONAL IMPACT EVALUATION DESIGNS AND ESSENTIAL TOOLS 195 of which would be consistent with the argument that her appointment, not the new law, was responsible for the drop in corruption measured in the surveys. The more such competing explanations can be identified and ruled out, the more confidence there can be in the conclusion that the legislation was responsible for the positive outcome. Evaluators are in a better position to rule out alternative explanations to the extent that USAID or its implementing partners can manipulate the timing of the intervention. An effort can be made even before a program is begun to identify other planned interventions or major events that could affect the outcome of interest and make it hard to disentangle the effect of USAIDâs program from other possible factors. In this context a deci- sion could be made to delay or speed up implementation of the program to minimize the likelihood that temporal changes in the measurement of program outcomes reflect things other than USAIDâs program. To make this idea more concrete, imagine an intervention designed to increase the quantity and quality of debate in a parliament. The intervention might involve a series of training sessions on parliamentary business, a change in the rules that ties salary to attendance and participation, or an account- ability mechanism that reports to the public on the activities of members of parliament. Regardless of the intervention, the outcome of interest is clear: whether members exhibit higher attendance rates and are more active in parliament after the project is complete. The problem is that many other factors might be responsible for an increase in attendance or participationâfor example, if preparations for the budget begin soon after the program is initiated, this may drive up attendance and participa- tion. If these other factors can be anticipated and avoided in planning the timing of the intervention, even stronger inferences can be drawn from temporal trends in the outcome variables. A second strategy for improving causal inference in an N = 1 design is to look beyond the narrow outcome that the project was designed to affect and try to identify other outcomes that would be consistent with positive project impact. The example provided earlier from Uganda of using the success of projects targeting the disabled to verify the effectiveness of completely separate projects designed to promote the empowerment of marginalized citizens illustrates this technique. With regard to evaluating the effectiveness of the anticorruption legislation, an example of such a strategy would be to look at changes in applications for business licenses, which might be expected to rise as the requirement that applicants pay bribes diminishes. Again, the greater the number of outcomes consistent with project success that can be identified, the more confidence there can be in inferring that the project was, in fact, successful. Designing impact evaluations where a large number of units are available and USAID has control over where or with whom it will work is
196 IMPROVING DEMOCRACY ASSISTANCE relatively straightforward, although the actual design requires substantial skill. In principle, all that is needed is a random number generatorâor even just a coin to tossâto assign units to treatment or control groups. Then once the project is implemented, all that is needed is to compare average outcomes in the control and treatment groups and test whether the differences are statistically significant. The higher art of impact evalu- ation comes in situations where randomized evaluations are not possible. Under such circumstances, identifying sound project designs requires flexibility, creativity, understanding of the facts on the ground, and a good sense of the implications of various design decisions for the interpretation of program evaluations. This makes them difficult, both to design and, because of the need to tailor the methodology to the details of the particu- lar project in question, to specify ex ante. However, it does not make them impossible. As the many examples provided in this chapter suggest, there are opportunities to move beyond the current M&E approach to impact evaluations that provide key information for determining program effects, even in the most difficult, and quite common, situation where there is only a single unit being treated. Good designs require skilled, well-trained pro- gram designersâthe cultivation of which should be a priority for USAID. It also requires an organization with the resources and capacity to do the workâissues discussed in Chapters 8 and 9. Conclusions For every DG-promoting activity that USAID undertakes, particularly those that are central to its mission or that involve the expenditure of large sums of money, USAID wants to be able to answer two questions: Was doing the activity better than doing nothing at all? If so, how much better? Generally, although they may serve other management purposes well, the required M&E designs that USAID currently employs are insuf- ficient to do this. Answering these questions requires the use of impact evaluations, which in turn require somewhat different designs. The com- mittee found that the vast majority of USAID staff that it encountered were deeply committed to improving democratic governance around the world and to being able to evaluate the progress they were, or were not, making. The committee also found that many USAID staffers were frustrated by their inability to better answer the basic question: Are we having a positive impact? The impact evaluation designs described in this report, and the exam- ples presented in the previous two chapters, suggest that in principle there is considerable scope for USAID to improve its ability to answer this question. The committee would neither expect nor recommend that the agency undertake impact evaluations of all of its activities. The com-
ADDITIONAL IMPACT EVALUATION DESIGNS AND ESSENTIAL TOOLS 197 mitteeâs specific recommendation is that USAID begin with a modest and focused initiative to examine the feasibility of applying such impact evaluation designs, including those using randomized assignment, to a small number of projects. At the same time, the committee realizes that undertaking more impact evaluations alone will not provide the broadly based and con- text-sensitive information that USAID needs to plan its DG programs. Process evaluations, the kinds of case studies discussed in Chapter 4, and more informal lessons from the field obtained by DG staff, implementers, nongovernmental organizations, and independent researchers provide important insights, valuable hypotheses, and illustrations of how pro- grams are received and respond to changing conditions. The committee believes that USAID needs to develop organizational characteristics that will provide both incentives for more varied evaluations of its projects and mechanisms to help agency staff absorb, discuss, and continually learn from a variety of sources about those factors that affect the impact of DG programs. REFERENCES Bertrand, M., Duflo E., and Mullainathan, S. 2004. How Much Should We Trust Difference- in-Difference Estimates? Quarterly Journal of Economics 119(1):249-275. Institute for Development Research Alternatives. 2007. Corruption in Albania: Perception and Experience: Survey 2007, Summary of Findings. Tirana: Institute for Development Research Alternatives and Casals & Associates. Seligson, M.A. 2002. The Impact of Corruption on Regime Legitimacy: A Comparative Study of Four Latin American Countries. Journal of Politics 64:408-433. Seligson, M.A. 2006. The Measurement and Impact of Corruption Victimization: Survey Evidence from Latin America. World Development 34(2):381-404. Seligson, M.A., and Recanatini, F. 2003. Governance and Corruption. Pp. 411-443 in Ecuador: An Economic and Social Agenda in the New Millennium, V. Fretes-Cibils, M.M. Giugale, and J.R. LÃ³pez-CÃ¡lix, eds. Washington, DC: World Bank.