Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
5 Methodologies of Impact Evaluation Introduction This chapter presents a guide to impact evaluations as they are cur- rently practiced in the field of foreign assistance. The committee rec- ognizes, as stated before, that the application of impact evaluations to foreign assistance in general, and to democracy and governance (DG) projects in particular, is controversial. The purpose of this chapter is thus to present the range of impact evaluation designs, as a prelude to the results of the committeeâs field teamsâ exploration of their potential application as part of the mix of evaluations and assessments undertaken by the U.S. Agency for International Development (USAID) presented in the next two chapters. The highest standard of credible inference in impact evaluation is achieved when the number of people, villages, neighborhoods, or other groupings is large enough, and the project design flexible enough, to allow randomized assignment to treatment and nontreatment groups. Yet the committee realizes that this method is often not practical for many DG projects. Thus this chapter also examines credible inference designs for cases where randomization is not possible and for projects with a small number of unitsâor even a single caseâinvolved in the project. Some of the material in this chapter is somewhat technical, but this is necessary for this chapter to serve, as the committee hopes it will, as a guide to the design of useful and credible impact evaluations for DG missions and implementers. The technical terms used here are defined in the chapter text and also in the Glossary at the end of the report. Also, 119
120 IMPROVING DEMOCRACY ASSISTANCE examples are provided to show how such designs have already been implemented in the field for various foreign assistance and democracy assistance programs. Importance of Sound and Credible Impact Evaluations for DG Assistance As discussed in some detail in Chapter 2, until 1995 USAID required evaluations of all its projects, including those in DG, to assess their effec- tiveness in meeting program goals. Most of the evaluations, however, were process evaluations: post-hoc assessments by teams of outside experts who sought to examine how a project unfolded and whether (and why) it met anticipated goals. While these were valuable examinations of how projects were implemented and their perceived effects, such evaluations generally could not provide the evidence of impact that would result from sound impact assessments. This was because in most cases they lacked necessary baseline data from before the project was begun and because in almost all cases they did not examine appropriate comparison groups to determine what most likely would have occurred in the absence of the projects (see Bollen et al  for a review of past DG evaluations). As noted, the number of such evaluations undertaken by USAID has declined in recent years. Evaluations are now optional and are conducted mainly at the discretion of individual missions for specific purposes, such as when a major project is ending and a follow-on is expected or when a DG officer feels that something has âgone wrongâ and wants to understand and document the reasons for the problem. Such special evaluations can have substantial value for management purposes, but the committee believes that USAID is overlooking a major opportunity to learn systematically from its experience about project success and failure by not making impact evaluations a significant part of its monitoring and evaluation (M&E) activities where appropriate and feasible. Such impact evaluations could be particularly useful to provide insights into the effects of its largest-scale and most frequently used projects and to test key devel- opment hypotheses that guide its programming. There are three fundamental elements of sound and credible impact evaluations. First, such evaluations require measures relevant to desired project outcomes, not merely of project activity or outputs. Second, they require good baseline, in-process, and endpoint measures of those out- comes to track the effects of interventions over time. Finally, they require comparison of those who receive assistance with appropriate nontreatment groups to determine whether any observed changes in outcomes are, in fact, due to the intervention. The committeeâs discussions with USAID staff, contractors for USAID,
METHODOLOGIES OF IMPACT EVOLUTION 121 and our own field study of USAID missions have shown that, even within the current structure of project monitoring, USAID is already engaged in pursuing the first and second requirements. While in some cases prog- ress remains to be made on devising appropriate outcome measures and in ensuring the allocation of time and resources to collect baseline data, USAID has generally recognized the importance of these tasks. These efforts do vary from mission to mission, according to their available resources and priorities, so considerable variation remains among mis- sions and projects in these regards. However, the committee found that there is little or no evidence in current or past USAID evaluation practices that indicates the agency is making regular efforts to meet the third requirementâcomparisons. With rare exceptions, USAID evaluations and missions generally do not allocate resources to baseline and follow-up measurements on nonin- tervention groups. Virtually all of the USAID evaluations of which the committee is aware focus on studies of groups that received USAID DG assistance, and estimates of what would have happened in the absence of such interventions are based on assumptions and subjective judgments, rather than explicit comparisons with groups that did not receive DG assistance. It is this almost total absence of comparisons with nontreated groups, more than any other single factor, that should be addressed in order to draw more credible and powerful conclusions about the impact of USAID DG projects in the future. To briefly illustrate the importance of conducting baseline and follow- up measurements for both treated and nontreated comparison groups, consider the following two simple examples: 1. A consulting firm claims to have a training program that will make legislators more effective. To demonstrate the programâs effectiveness, the firm recruits a dozen legislators and gives them all a year of training. The firm then measures the number of bills those legislators have introduced in parliament in the year prior to the training and the number of bills introduced in the year following the training and finds that each legisla- tor increased the number of bills he or she had introduced by 30 to 100 percent! Based on this the consultants claim they have demonstrated the efficacy of the program. Yet to know whether or not the training really was effective, we would need to know how much each legislatorâs performance would have changed if he or she had not taken the training program. One way of answering this question is to compare the performance of the legisla- tors who were trained to the performance of a comparable set of legisla- tors who were not. When someone points this out to the consultants and they go back and measure the legislative activity of all the legislators for
122 IMPROVING DEMOCRACY ASSISTANCE the prior year, they find that the legislators who were not in the training group introduced, on average, exactly the same number of bills as those who were trained. What has happened? It is possible that the increase in the number of bills presented by all legislators resulted from greater experience in office, so that everyone introduces more bills in his or her third year in office than in the first year. Or there may have been a rule change, or policy pressures, that resulted in a general increase in legislative activity. Thus it is entirely possible that the observed increase in legislative activity by those trained had nothing to do with the training program at all, and the programâs effect might have been zero. Or it is possible that those legislators who signed up for the program were an unusual group. They might have been those legislators who were already the most active and who wanted to increase their skills. Thus the program might have worked for them but would not have worked for others. Another possibility is that the legislators who signed up were those who were the least active and who wanted the training to enable them to âcatch upâ with their more active colleagues. In this case the results do show merit to the training program, but again it is not clear how much such a program would help the average legislator improve. The only way to resolve these various possibilities would be to have taken measures of legislative activity before and after the training program for both those legislators in the program and those not in the program. While it would be most desirable to have randomly assigned legislators to take the train- ing or not, that is not necessary for the before and after comparison measures to still yield valuable and credible information. For example, even if legislators themselves chose who would receive the training, we would want to know whether the trained group had previously been more active, or less active, than their colleagues not receiving training. We could also then make statistical adjustments to the comparison, reflecting differences in prior legislative activity and experience between those who were trained and those who were not, to help determine what the true impact of the training program was, net of other factors that the training could not affect. In short, simply knowing that a training program increased the leg- islative activity of those trained does not allow one to choose between many different hypotheses regarding the true impact of that program, which could be zero or highly effective in providing âcatch-upâ skills to legislators who need them. The only way to obtain sound and credible judgments of a programâs effect is with before and after measurements on both the treatment and the relevant nontreatment groups. 2. The same consulting firm also claims to have a program that will increase integrity and reduce corruption among judges. To test the
METHODOLOGIES OF IMPACT EVOLUTION 123 programâs effectiveness, the firm recruits a dozen judges to receive the programâs training for a year. When the consultants examine the rate of perceived bribery and corruption, or count cases thrown out or settled in favor of the higher status plaintiff or defendant, in those courts where the judges were trained, they find that there has been no reduction in those measures of corruption. On this basis the donor might decide that the pro- gram did not work. However, to really reach this conclusion, the donor would have to know whether, and how much, corruption would have changed if those judges had not received the training. When the donor asks for data on perceived bribery and corruption, or counts of cases thrown out or settled in favor of higher status plaintiffs or defendants, in other courts it turns out to be much higher than in the courts where judges did receive the training. Again, the new information forces us to ask: What really happened? It is possible that opportunities for corruption increased in the country, so that most judges moved to higher levels of corruption. In this case the constant level of corruption observed in the courts whose judges received training indicated a substantially greater ability to resist those opportunities. So, when properly evaluated against a comparison group, it turns out that the program was, in fact, effective. To be sure, however, it would be valuable to also have baseline data on corruption levels in the courts whose judges were not trained; this would confirm the belief that corruption levels increased generally except in those courts whose judges received the program. Without such data it is not known for certain whether this is true or whether the judges who signed up for the train- ing were already those who were struggling against corruption and who started with much lower rates of corruption than other courts. These examples underscore the vital importance of comparisons with groups not receiving the treatment in order to avoid misleading errors and to accurately evaluate project impacts. From a public policy standpoint, the cost of such errors can be high. In the examples given here, it might have caused aid programs to waste money on training programs that were, in fact, ineffective. Or it might have led to cuts in funding for anticorrup- tion programs that were, in fact, highly valuable in preventing substantial increases in corruption. This chapter discusses how best to obtain comparisons for evaluat- ing USAID democracy assistance projects. Such comparisons range from the most rigorous possibleâcomparing randomly chosen treatment and nontreatment groupsâto a variety of less exacting but still highly use- ful comparisons, including multiple and single cases, time series, and matched case designs. It bears repeating: The goal in all of these designs is to evaluate projects by using appropriate comparisons in order to increase confidence in drawing conclusions about cause and effect.
124 IMPROVING DEMOCRACY ASSISTANCE Plan of This Chapter The chapter begins with a discussion of what methodologists term âinternalâ and âexternalâ validity. Internal validity is defined as âthe approximate truth of inferences regarding cause-effect or causal relation- shipsâ (Trochim and Donnelly 2007:G4). The greater the internal validity, the greater the confidence one can have in the conclusions that a given project evaluation reaches. The paramount goal of evaluation design is to maximize internal validity. External validity refers to whether the con- clusions of a given evaluation are likely to be applicable to other projects and thereby contribute to understanding in a general sense what works and what does not. Given that USAID implements similar projects in multiple country settings, the external validity of the findings of a given project evaluation is particularly important. This section of the chap- ter also stresses the importance of what the committee terms âbuilding knowledge.â The second part of the chapter outlines a typology of evaluation meth- odologies that USAID missions might apply in various circumstances to maximize their ability to assess the efficacy of their programming in the DG area. Large N randomized designs permit the most credible infer- ences about whether a project worked or not (i.e., the greatest internal validity). By comparison, the post-hoc assessments that are the basis of many current and past USAID evaluations provide perhaps the least reli- able basis for inferences about the actual causal impact of DG assistance. Between these two ends of the spectrum lie a number of different evalu- ation designs that offer increasing levels of confidence in the inferences one can make. In describing these various evaluation options, the approach taken in this chapter is largely theoretical and academic. Evaluation strategies are compared and contrasted based on their methodological strengths and weaknesses, not their feasibility in the field. While a first step is taken at the end of the chapter in the direction of exploring whether the most rigorous evaluation designâlarge N randomized evaluationâis feasible for many DG projects, a more extensive treatment of this key question is reserved for the chapters that follow, when the committee presents the findings of its field studies, in which the feasibility of various impact evaluation designs is explored for current USAID DG programs with mission directors and DG staff. Points of Clarification Before plunging into the discussion of evaluation methodologies, a few important points of clarification are needed. First, it should be clear that the committeeâs focus on impact evaluations is not intended to deny
METHODOLOGIES OF IMPACT EVOLUTION 125 the need for, or imply the unimportance of, other types of M&E activities. The committee recognizes that monitoring is vital to ensure proper use of funds and that process evaluations are important management tools for investigating the implementation and reception of DG projects. This report focuses on how to develop impact evaluations because the com- mittee believes that at present this is the most underutilized approach in DG program evaluations and that therefore USAID has the most to gain if it is feasible to add sound and credible impact evaluations to its portfolio of M&E activities. Second, the committee recognizes that not all projects need be, or should be, chosen for the most rigorous forms of impact evaluation. Doing so would likely impose an unacceptably high cost on USAIDâs DG programming. The committee is therefore recommending that such evaluations initially only be undertaken for a select few of USAIDâs DG programs, a recommendation emphasized in Chapter 9. The committee does believe, however, that DG officers should be aware of the potential value of obtaining baseline and comparison group information for proj- ects to which they attach great importance, so that they can better decide how to develop the mix of M&E efforts across the various projects that they oversee. Third, before beginning the task of evaluating a project, precisely what is to be evaluated must be defined. Evaluating a project requires the identification of the specific intervention and a set of relevant and measur- able outcomes thought to result from that policy intervention. Even this apparently simple task can pose challenges, since most DG programs are complex (compound) interventions, often combining several activities (e.g., advice, formal training, monetary incentives) and are often expected to produce several desired outcomes. A project focused on the judiciary, for example, may include a range of different activities intended to bolster the independence and efficiency of the judiciary in a country and might be expected to produce a variety of outcomes, including swifter process- ing of cases, greater impartiality among plaintiffs and defendants, greater conformity to statutes or precedents, and greater independence vis-Ã -vis the executive. The evaluator must therefore decide whether to test the whole project or parts of the project or whether it would make sense, as discussed further below, to reconfigure the project to allow for clearer impact evaluation of specific interventions. As USAIDâs primary focus will always be on program implementa- tion, rather than evaluation per se, evaluators will need to respond to the challenges posed by often ambitious and multitasked programs. At this point, a note on terminology is required. As noted above, an âactivityâ is defined as the most basic sort of action taken in the field, such as a training camp, a conference, advice rendered, money tendered,
126 IMPROVING DEMOCRACY ASSISTANCE and so forth. A âprojectâ is understood to be an aggregation of activities, including all those mentioned in specific USAID contracts with imple- menters, such as in requests for proposals and in subsequent documents produced in connection with these projects. A project can also be referred to as an âinterventionâ or âtreatment.â The question of what constitutes an appropriate intervention is a criti- cal issue faced by all attempts at project evaluation. A number of factors impinge on this decision. Lumping activities within a given project together for evaluation often makes sense. If all parts of a program are expected to contribute to common outcomes, and especially if the bundled activities will have a stronger and more readily observed outcome than the separate parts, then treating the set of activities together as a single intervention may be the best way to proceed. In other cases, trying to separate various activities and measuring their impact may be preferred. The value of disaggregation seems clear from the standpoint of impact evaluation. After all, if only one part of a five-part program is in fact producing 90 percent of the observed results, this would be good to know, so that only that one part continues to be supported. But whether or not such a separation seems worth testing really depends on whether it is viable to offer certain parts of a project and not others. Sometimes it is possible to test both aggregated and disag- gregated components of a project in a single research design. This requires a sufficient number of cases to allow for multiple treatment groups. For example, Group A could receive one part of a program, Group B could receive two parts of a program, Group C could receive three parts of a program, and another group would be required as a control. In this example, three discrete interventions and their combination could be evaluated simultaneously. Many additional factors may impinge on the crafting of an appropri- ate design for impact evaluation of a particular intervention. These are reviewed in detail in the subsequent section. The committee understands that there is no magic formula for deciding when an impact evaluation might be desirable or which design is the best trade-off in terms of costs, need for information, and policy demands. What is clear, however, is that since impact evaluations are, in effect, tests of the hypothesis that a given intervention will create different outcomes than would be observed in the absence of that intervention, how well one specifies that hypothesis greatly influences what one will find at the end of the day. The question asked determines the sort of answers that can be received. The committee wants to flag this as a critical issue for USAID policymakers and project implementers to consider; further suggestions are given in Chapters 8 and 9 for how this could be addressed as part of an overall Strategic and
METHODOLOGIES OF IMPACT EVOLUTION 127 Operational Research Agenda project for learning about DG program effectiveness to guide policy programming. Internal Validity, External Validity, and Building Knowledge Internal Validity A sound and credible impact evaluation has one primary goal: to determine the impact of a particular project in a particular place at a par- ticular time. This is usually understood as a question of internal validity. In a given instance, what causal effect did a specific policy intervention, X, have on a specific outcome, Y? This question may be rephrased as: If X were removed or altered, would Y have changed? Note that the only way to answer this question with complete cer- tainty is to go back in time to replay history without the project (called the âthe counterfactualâ). Since that cannot be done, we try to come as close as possible to the âtime machineâ by holding constant any background features that might affect Y (the ceteris paribus conditions) while altering X, the intervention of interest. We thus replay the scenario under slightly different circumstances, observing the result (Y). It is in determining how best to simulate this counterfactual situation of replaying history without the intervention that the craft of evaluation design comes into play. Indeed, a large literature within the social sciences is devoted to this questionâoften characterized as a question of causal assessment or research design (e.g., Shadish et al 2002, Bloom 2005, Duflo et al 2006b). The following section attempts to reduce this complicated set of issues down to a few key ingredients, recognizing that many issues can be treated only superficially. Consider that certain persistent features of research design may assist us in reaching conclusions about whether X really did cause Y: (1) inter- ventions that are simple, strong, discrete, and measurable; (2) outcomes that are measurable, precise, determinate, immediate, and multiple; (3) a large sample of cases; (4) spatial equivalence between treatment and control groups; and (5) temporal equivalence between pre- and posttests. Each of these is discussed in turn. 1. The intervention: discrete, with immediate causal effects, mea- surable. A discrete intervention that registers immediate causal effects is easier to test because only one pre- and posttest is necessary (perhaps only a posttest if there is a control group and trends are stable or easily neu- tralized by the control). That is, information about the desired outcome is collected before and after the intervention. By contrast, an intervention
128 IMPROVING DEMOCRACY ASSISTANCE that takes place gradually, or has only long-term effects, is more difficult to test. A measurable intervention is, of course, easier to test than one that is resistant to operationalization (i.e., must be studied through proxies or impressionistic qualitative analysis). 2. The outcome(s): measurable, precise, determinate, and multiple. The best research designs feature outcomes that are easily observed, that can be readily measured, where the predictions of the hypotheses guiding the intervention are precise and determinate (rather than ambiguous), and where there are multiple outcomes that the theory predicts, some of which may pertain to causal processes rather than final policy outcomes. The lat- ter is important because it provides researchers with further evidence by which to test (confirm or disconfirm) the underlying hypothesis linking the intervention to the outcome and to elucidate its causal mechanisms. 3. Large sample size. N refers here to the number of cases that are available for study in a given setting (i.e., the sample size). A larger N means that one can glean more accurate knowledge about the effective- ness of the intervention, all other things being equal. Of course, the cases within the sample must be similar enough to one another to be compared; that is, the posited causal relationship must exist in roughly the same form for all cases in the sample or any dissimilarities must be amenable to post- hoc modeling. Among the questions to be addressed are: How large is the N? How similar are the units (cases) in respects that might affect the pos- ited causal relationships? If dissimilar, can these heterogeneous elements be neutralized by some feature of the research design (see below)? 4. Spatial equivalence (between treatment and control groups). By pure spatial comparisons what is meant are controls that mirror the treat- ment group in all ways that might affect the posited causal relationship. The easiest way to achieve equivalence between these two groups is to choose cases randomly from the population. Sometimes, nonrandomized selection procedures can be achieved, or exist naturally, that provide equivalence, but this is relatively rare. The key question to ask is always: How similar are the treatment and control groups in ways that might affect the intended outcome? This is often referred to as âpretreatment equivalence.â Other important questions include: Can the treatment cases be chosen randomly, or through some process that approximates random selection? Can the equivalence initially present at the point of intervention between treatment and control groups be maintained over the life of the study (i.e., over whatever time is relevant to observe the putative causal effects)? This may be referred to as âposttreatment equivalence.â 5. Temporal equivalence (between pre- and posttests). Causal attri- bution works by comparing spatially and/or temporally. This is usually done through pre- and posttreatment tests (i.e., measurements of the outcome before and after the intervention, creating two groups, the pre-
METHODOLOGIES OF IMPACT EVOLUTION 129 intervention group and the postintervention group. Of course, it is the same case, or set of cases, observed at two points in time. However, such comparisons (in isolation from spatial controls) are useful only when the case(s) are equivalent in all respects that might affect the outcome (except, of course, insofar as the treatment itself). More specifically, this means that (1) the effects of the intervention on the case(s) are not obscured by confounders, which are other factors occurring at roughly the same time as the intervention which might affect the outcome, and (2) the outcome under investigation either is stable or has a stable trend (so that the effect of the intervention, if any, can be observed). Note that when there is a good spatial control these issues are less important. By contrast, when there is no spatial control, they become absolutely essential to the task of causal attribution. For temporal control the key questions to ask are: Are comparable pre- and posttests possible? Is it possible to collect data for a longer period of time so that, rather than just two data points, one can construct a longer time series? Are there trends in the outcome that must be taken into account? If trends are present, are they fairly stable? Can we anticipate that this stability will be retained over the course of the research (in the absence of any intervention)? Is the intervention correlated (tem- porally) with other changes that might obscure causal attribution? External Validity External validity is the generalizability of the project beyond a single case. To provide policymakers at USAID with relevant information, the results of a project evaluation should be generalizable; that is, they must be true (or plausibly true) beyond the case under study. Recall that we understand that impact evaluation (as opposed to project monitoring) will most likely be an occasional event applied to a set of the most important and most frequently used projects, not one routinely undertaken for all projects. This means that the value of the evaluation is to be found in the guidance it may offer policymakers in designing projects and allocating funds over the long term and across the whole spectrum of countries in which USAID works. There will always be questions about how much one can generalize about the impact of a project. The fact that a project worked in one place, at one time, may or may not indicate its possible success in other places and at other times. The committee recognizes that the design of USAID projects and the allocation of funds are a learning process and the politi- cal situation and opportunities for intervention in any given country are a moving target. Even so, project officers must build on what they know, and this knowledge is largely based on the experiences of projects that are currently in operation around the world. Some projects are perceived
130 IMPROVING DEMOCRACY ASSISTANCE to work well while others are perceived to work poorly or not at all. It is these general perceptions of âworkabilityâ that are the concern here. With a number of sound impact evaluations of a specific type of project in several different settings, USAID would be able to learn more from its interventions, rather than rely solely on the experiences of individuals. To maximize the utility of such impact evaluations, each aspect of the research design must be carefully considered. Two factors are paramount: realism in evaluation design and careful case selection. Realism means that the evaluation of a project should conform as closely as possible to existing realities on the ground; otherwise, it is likely to be dismissed as an exercise with little utility for USAID officers in the field. âRealitiesâ refers to the political facts at home and abroad, the structure of USAID programming, and any other contextual features that might be encountered when a project is put into operation. The com- mittee recognizes that some factors on the ground may need to be altered in order to enhance the internal validity of a research design, a matter addressed below. Yet for purposes of external validity in the policymaking world of USAID, these factors should be kept to a minimum. Case selection refers to how casesâactivities or interventionsâare chosen for evaluation. Several strategies are available, each with a slightly different purpose. However, all relate to the achievement of external validity. The most obvious strategy is to choose a typical case, a context that is, so far as one can tell, typical of that projectâs usual implementation and also one that embodies a typical instance of posited cause-and-effect rela- tionships. Otherwise, it may be difficult to generalize from that projectâs experience. A second strategy is known as the least likely (or most difficult) case. If one is fairly confident of a projectâs effectiveness, perhaps because other studies have already been conducted on that subject, confidence can be enhanced by choosing a case that would not ordinarily be considered a strong candidate for project success. If the project is successful there, it is likely to be successful anywhere (i.e., in âeasierâ circumstances). Alterna- tively, if the project fails in a least-likely setting, then one has established a boundary for the population of cases to which the project may plausibly apply. A third strategy is known as the most likely case. As implied, this kind of case is the inverse of the previous: It is one where a given intervention is believed most likely to succeed. This kind of case is generally useful only when the intervention, against all odds, is shown by a careful impact evaluation to have little or no effect (otherwise, common wisdom is con- firmed). Failure in this setting may be devastating to the received wisdom, for it would have shown that even when conditions are favorable the project still does not attain its expected result.
METHODOLOGIES OF IMPACT EVOLUTION 131 Other strategies of case selection are available; further strategies and a more extended discussion can be found in Chapter 5 of Gerring (2007). For the purposes of project evaluation at USAID, however, these three appear likely to be the most useful. Because of the varied contexts in which even âtypicalâ USAID proj- ects are implemented, it would be best to conduct impact evaluations to determine the effects of such projects in several different places. Ideally, USAID could choose a âtypicalâ case, a least likely case, and a most likely case for evaluation to determine whether a project is having its desired impact. Even if this spread is not readily available, choosing two or three different sites to evaluate widely used projects would help address con- cerns about generalizability more effectively than using only a single site for an impact evaluation. Building Knowledge It is important to keep in mind that no single evaluation is likely to be regarded as complete evidence for or against a project, nor should it. Regardless of how carefully an evaluation is designed, there is always the possibility of random errorâfactors at work in a country or some sector of a country that cannot be controlled by carefully constructed evaluation designs. More importantly, there is always the possibility that an interven- tion may work differently in one setting than it does in others. Thus the process of evaluating projects should always involve multiple evaluations of the same basic intervention. This means that strategies of evaluation must take into account the body of extant knowledge on a subject and the knowledge that may arise from future studies (supported by USAID, other agencies, or the academic community). This is the process of build- ing knowledge. The most successful companies in the private sphere tend to be âlearning organizationsâ that constantly build knowledge about their own activities (Senge 2006). This process may be disaggregated into four generic goals: building internal validity, building external validity, building better project design, and building new knowledge. The first three issues may be understood as various approaches to âreplication.â If USAID is concerned about the internal validity of an impact evaluation, subsequent evaluations should replicate the original research design as closely as possible. If USAID is concerned about the external validity of an evaluation, then replications should take place in different sites. If USAID is concerned with the specific features of a proj- ect, replications should alter those features while keeping other factors constant. The fourth issue departs from the goal of replication; here the goal is to unearth new insights into the process of development and the degree to which it may be affected by USAID policies. In this instance it is no longer so important to replicate features of previous evaluations.
132 IMPROVING DEMOCRACY ASSISTANCE Even so, the committee emphasizes that the important features of a research designâthe treatment, the outcomes anticipated to result from the treatment, and the settingâshould be standardized as much as pos- sible across each evaluation. Doing so helps ensure that the results of the evaluation will be comparable to evaluations of similar projects, so that knowledge accumulates about that subject. If the treatments and evalua- tion designs change too much from evaluation to evaluation, less can be learned. Using impact evaluations in no way reduces the need for sound judg- ment from experienced DG staff; detailed knowledge of the country and specific conditions is essential for creating a good impact design. More generally, there are often external events that can have consequences for an ongoing project or its evaluation. In such cases an experienced DG officer will need to appraise the effect of these events on the projectâs process and outcomes. However, an appropriate mix of evaluations offers better information about projects on which DG staff can create new, more effective policy. A Typology of Impact Evaluation Designs A major goal of this chapter is to identify a reasonably comprehen- sive, yet also concise, typology of research designs that might be used to test the causal impact of projects supported by USAIDâs DG office. Six basic research designs seem potentially applicable: (1) large N with ran- dom assignment of the project; (2) large N comparison without random- ized assignment of the project; (3) small N with randomized assignment of the project; (4) small N without randomized assignment of the project; (5) N = 1, where USAID has control of where or when the project is put in place; and (6) N = 1, where USAID has little control over where or when the project is placed. Each option is summarized in Table 5-1. Each research design shown in the table shares a dedicated effort to collect pre- and posttreatment measures of the policy outcomes of interest. Hitherto, baseline measurements have been an inconsistent part of USAID evaluations (Bollen et al 2005); although baseline data are generally sup- posed to be collected as part of current program monitoring, the quality may vary substantially. The absence of good baseline data makes it much more difficult to demonstrate a causal effect. No project can be adequately tested without a good measurement of the outcome of interest prior to the policy intervention. Naturally, such a measurement should be paired with a corresponding measurement of the outcome after the policy interven- â Randomized assignment of a treatment is often called an experiment in texts on research design (see, e.g., Trochim and Donnelly 2007).
METHODOLOGIES OF IMPACT EVOLUTION 133 TABLE 5-1â A Typology of Suggested Research Designs Available Units (N) Manipulability Pre-/Posttests Suggested Research Design Large Yes Yes Large N randomization Large No Yes Large N comparison Small Yes Yes Small N randomization Small No Yes Small N comparison 1 Yes Yes N=1 study (manipulable) 1 No Yes N=1 comparison tion. (See Chapters 6 and 7 for further discussion of appropriate measures of outcomes, with examples from the committeeâs field visits.) Together, these provide pre- and posttests of the policy intervention. In the large N randomized assignment designâbut only in that caseâ it is possible to evaluate project outcomes even in the absence of base- line data, as shown, for example, in Hyde (2006), where she evaluated the impact of election monitors from observed differences in the votes received by opposition parties in precincts with and without the ran- domly assigned monitors. However, this procedure always assumes that the intervention and control groups would show similar outcomes in the absence of any intervention. It is better, wherever possible, to check this assumption with baseline data. This is particularly important when the number of cases is modest and full randomization is not possible, and many other factors besides the intervention can affect outcomes. Even in the case of the large N randomization, baseline data are often useful for checking the assumptions on which programming is based, or for plan- ning or evaluating other projects later. The six research design options are distinguishable from one another along two key dimensions: (1) the number of units (N) available for analy- sis and (2) USAIDâs capacity to manipulate key features of the projectâs design and implementation. Usually, the capacity to evaluate projects is enhanced when N is large (i.e., when there are a large number of individu- als, organizations, or governments that can be compared to one another) and when the project can be implemented in a randomized way. The large N randomized intervention is thus regarded as the âgold standardâ of project evaluation methods (Wholey et al 2004). Each step away from the large N randomized design generally involves a loss in inferential power or, in other words, less confidence in the ability to make inferences about causal impact based on the results of the evaluation. Even so, this certainly does not imply, and the committee is not argu- â For examples, see the research papers on the Poverty Action Lab of MIT webpage: http:// www.povertyactionlab.com/papers/.
134 IMPROVING DEMOCRACY ASSISTANCE ing, that the large N randomized intervention is the only viable evaluation tool available to USAID. If this were the case, many projectsâand the millions of dollars used to fund themâcould not be the subject of impact evaluations. It is for this reason that the committee offers a longer list of options than is recognized by many current texts on project evaluation (e.g., Bloom 2005, Duflo et al 2006b). But the results of the committeeâs visits to USAID offices in the field, review of USAID documents, and dis- cussions with USAID DG officials and implementers suggest that using randomization is feasible at least in theory in many instances, which would greatly enhance the ability to evaluate the impacts of a project. Of course, no simple classification of types can hope to address all the research design issues raised by the multifaceted programs supported by USAIDâs DG portfolio of projects. Arguably, every policy interven- tion is in some respects unique and thus poses different research design issues. Measuring impact is not easy. The committee offers the foregoing typology as a point of departure, a set of categories that capture the most salient features of different policies now supported by the USAID DG office, and the ways in which the causal impact of these policies might be feasibly evaluated. Citations in the text to existing work on these subjects should provide further guidance for project officers and implementers, although the literature on large N randomized treatment research designs is much more developed than the literature on other subjects. 1. Large N Randomized Evaluation The ideal research design is the randomized impact evaluation. Because of its technical demands, this approach should be employed where USAID DG officials have a strong interest in finding out the impact of an important project, especially those that are implemented in a reason- ably similar form across countries (e.g., decentralization initiatives, civic education projects, election monitoring efforts). Here, a large number of units are divided by random selection into treatment and control groups, a treatment is administered, and any differences in outcomes across the two groups are examined for their significance. Randomizing the treatment attempts to break a pool of possible treated units into two groups that are similar, indeed indistinguishable, before the treatment. Then, after the treatment, measurement on the desired outcome is taken for both groups. If there is a difference in outcomes between the groups, it can reasonably be inferred that the difference was attributable to the policy. Randomization creates the best comparisons because the two groupsâtreated and untreatedâare more alike than in any other design. Because randomization, with sufficiently large numbers of units, creates â For further discussion of these issues, see Gerring and McDermott (2007).
METHODOLOGIES OF IMPACT EVOLUTION 135 two groups in which all characteristics can be assumed to be equally distributed across the two groups, there is technically no need to have preintervention baseline measures, as these measures are assumed to be the same in each group due to their random assignment. The abil- ity to do without baseline measures in large N randomized assignment designs could actually reduce the expenditure on this type of evaluation, as opposed to the costs incurred in other designs that require gathering data on baseline indicators. As discussed above, in the context of many projects in a country, gathering baseline data to evaluate the intervention in different ways, and measure other efforts, including activities and out- puts would still be valuable. Another advantage of randomized assignment in large N studies is that it often is perceived as the fairest method of distributing assistance in cases where the ability of USAID to provide DG assistance is limited and cannot cover all available units. Thus, for example, if only a certain fraction of judges or legislators in a country, or a certain fraction of vil- lages in a district, can be served by a given assistance program, having a lottery to determine who gets assistance first is often judged even by participants as the fairest way to allocate resources. Since this method also creates the best impact evaluation design, it is a situation in which the ethics of assigning assistance and the goals of evaluation design are mutually reinforcing. Common variations on the randomized treatment include ârolloutâ and âwaiting listâ protocols. With rollout protocols the treatment is given sequentially to different groups, with the order in which groups receive the treatment determined by random assignment. This solves the problem of how to distribute valued resources in a way that eventually makes them available to all but without destroying the potential for randomized control. It also offers the possibility of varying the treatment across each cohort, contingent on findings from previous cohorts. With waiting list protocols, the control group is comprised of those groups that are oth- erwise qualified and hence similar to the groups receiving treatment but were placed on a waiting list because of limits on funding. Evaluation is then undertaken on random samples from both the treatment and waiting list (control) groups. These latter groups may (or may not) be treated in subsequent iterations. There are a number of well-known problems that can undermine the effectiveness of this research design, which can be found in many meth- odology texts (e.g., see Box 9.2 in Trochim and Donnelly 2007), some of which will be discussed here. Perhaps most noteworthy in the case of many USAID projects is the risk of contamination, in which the treatment of some individuals or groups (e.g., training some judges or legislators) also affects the behavior of those not enrolled in training. In addition,
136 IMPROVING DEMOCRACY ASSISTANCE randomized designs may encounter other problems, such as units refus- ing to participate in the design or units dropping out in the middle of the intervention. However, if large numbers of cases are available, most of these issues can be reasonably dealt with by amending the research design, so that if recognized and managed, these problems will not fatally undermine the validity of the evaluation. The committee recognizes that political pressures to work with certain groups or locations, or to âjust get the project rolling,â can work against the careful design and implementation of randomized assignments. These and other problems are addressed in a more detailed discussion of how to apply randomization to actual USAID DG projects in Chapter 6. The present chapter focuses mainly on the methodological reasons why the efforts needed to carry out randomized assignments for project evalua- tions can be worthwhile in terms of the increased confidence they provide that genuine causal relationships are being discovered and hence real project impact. Unfortunately, from the standpoint of making the most credible impact evaluations, the units chosen to receive interventions from USAID are seldom selected at random. For example, nongovernmental organiza- tions (NGOs) chosen for funding are often selected though a competition that results in atypical NGOs getting treatments. Or judges and admin- istrators chosen to attend training workshops are selected based mainly on their willingness to participate. The problem here is that the criteria used for selecting NGOs and judges/administrators for participation in the project are almost certainly associated with a higher propensity to suc- ceed in the project than would be the case for the âtypicalâ NGO or judge, and this makes it impossible to assess project efficacy. If funded NGOs are found to do well or judges/administrators who attended workshops perform better, there is no way to rule out the possibility that the success observed is simply a function of having chosen groups or people who would have succeeded anyway or whose success was much greater than could generally be expected. The only way to avoid this pitfallâand to be in a position to know whether or not the project has had a positive impactâis to choose project participants randomly and then compare their performance to participants who were not selected to take part in the activity in question. The bottom line is that if there is a strong commitment to answering the questionââDid the resources spent on a given project truly yield posi- tive results?ââthe best way to reach the most definitive answer is through an impact evaluation that involves the random selection of units for treat- ment and the collection of data in both treatment and control groups. As discussed in Chapter 6, many USAID DG projects that the committee encountered in the field were quite amenable in principle to randomiza-
METHODOLOGIES OF IMPACT EVOLUTION 137 tion without significant changes in their design. And it bears repeating that in some cases of large N randomized treatment, USAID may be able to eliminate the costs of collecting baseline data, which might make this evaluation design more attractive. Randomized evaluations are useful for determining not only whether or not a given project/activity has had an effect but also where it appears to be most effective. To see this, consider Figure 5-1, which displays hypo- thetical data collected on outcomes among treatment and control groups for a particular USAID project. In this example, higher scores represent more successful outcomes. Based on these data, it can be concluded that the treatment was a suc- cess since, on average, units (people, municipalities, courts, NGOs, etc.) given the treatment scored better on the outcome of interest than units in the control group. (This would need to be confirmed with a statistical test, but for now assume the two distributions are indeed different.) It is important to point out, however, that not every unit in the treatment group did better than units in the control group. Some units in the control group did better than those in the treatment group, and some in the treat- ment group did worse than those in the control group. In fact, at least a handful of units in the treatment group did worse than the average unit in the control group. Also, there is quite a bit of variance in the performance of those in the treatment group. By exploring the factors associated with high and low scores among the treatment cohort, inferences can be made about which ones predispose recipients of the treatment to success or fail- Outcomes in Treatment and Control Groups 1 Score on Outcome of Interest .8 .6 .4 .2 control group treatment group FIGURE 5-1â Hypothetical outcome data from treatment and control groups. fig 5-1
138 IMPROVING DEMOCRACY ASSISTANCE ure (or, put slightly differently, where the project works well and not so well). Thus the randomized design allows us to conclude not just whether the project was effective in achieving its goals but also where efforts should be directed in the next phase in order to maximize the impact. 2. Large N Comparison Despite the utility of the large N randomized design, sometimes it is simply not possible to assign units randomly to the treatment group, even when the total number of units is large. The benefits of a large number of units for observing multiple iterations of the treatment, however, can still be exploited if one can overcome the following challenge: identify- ing and measuring those pretreatment differences between the treatment and control groups that might account for whatever posttreatment differ- ences are observed. In these circumstances there are a variety of statisti- cal procedures (e.g., propensity score matching, instrumental variables) for correcting the potential selection bias that complicates the analysis of causal effects. The âmatchingâ research design seeks to identify units that are simi- lar to the ones getting treatment and then comparing outcomes. For example, Heckman et al (1997) sought to evaluate a jobs training project in the United Statesâthe Job Training Partnership Act (JTPA). The JTPA pro- vides on-the-job training, job search assistance, and classroom training to youth and adults who qualify (see Devine and Heckman  for a more detailed analysis of the program). The U.S. Department of Labor commis- sioned an evaluation of the project to assess the impact of the main U.S. government training project for disadvantaged workers. Evaluators col- lected longitudinal data on those individuals who went through the JTPA and those who did not. Since the individuals who received the services were not chosen randomly, the evaluators constructed a nontreated group to compare them with, based on a number of criteria that matched the âin groupâ along many characteristics, such as location of residence, eligibil- ity for the program, income, and education. Using this matching design, the evaluators were able to compare the effect of the project by gathering data before and after it started. Another technique to use in a large N situation is the regression dis- continuity design (Shadish et al 2002:Chap. 7, Hahn et al 2001). Regression discontinuity is used in situations where the assignment of the treatment is based on the characteristics of the group that a policy is designed to affect, and the before and after outcomes of interest are measured for both groups. For example, in a reading program the assignment of a remedial â See Heckman (1997) for a more extensive discussion of the implicit behavioral assump- tions that justify the method of matching.
METHODOLOGIES OF IMPACT EVOLUTION 139 reading project is based on the preproject tests on the readers. At some cutoff point, students are assigned to the project or not. The expectation is that project success would produce a more positive trend after the intervention for those below the cutoff point. The trend before and after the intervention is estimated, and the differences are compared to see if the intervention had any discernible effect. Angrist and Lavy (1999), for example, used the regression discontinu- ity design to evaluate the effect of classroom size on student test scores in Israel. They compared classes with greater than and less than 40 students and found that class size was, in fact, linked to test performance. Yet another design useful to large N samples is the difference-in- difference (DD) approach. A DD design compares two cases, one that received the project and one that did not and compares the difference between their before and after levels on the relevant outcome variable. DD estimation has become a widespread method to estimate causal rela- tionships (Bertrand et al 2004). For example, if the DG project provides assistance to one judge and not another, before and after measures of a particular outcome variable should be taken for both and compared. In a regression that followed this design, the differences for each judgeâs behavior and between each judgeâs behavior are both estimated. The appeal of DD comes from its simplicity as well as its potential to circum- vent many of the endogeneity problems that typically arise when making comparisons between heterogeneous individuals (Meyer et al 1995). In an example of this approach, Duflo (2000) used a DD design to evaluate the effect of school construction on education and wages in Indonesia. Across several regions she compares one regionâs school con- struction with another that has not yet had its construction. As always, baseline data were critical to discovering any effect from the program. This design is useful when there is only one or a few treated units and is better than just a before-and-after analysis of a single unit since it offers a controlled comparison. Efforts to use statistical methods to approximate randomized designs are only as effective as the evaluatorâs ability to model the selection pro- cess that led some units to be given the treatment while others were not. Attention to gathering a battery of pretreatment measures across cases is critical to an effective large N comparison. With sufficient cases and systematic efforts to measure pre- and posttreatment outcomes, large N comparisons can provide meaningful insights into project impacts even when the treatment cannot be manipulated through randomization by USAID. 3. Small N Randomization In some instances it is possible to manipulate the policy of interest
140 IMPROVING DEMOCRACY ASSISTANCE (the treatment) but only across a very small set of cases. In this case it is not possible to use probability tests derived from statistical theory to gauge the causal impact of an experiment across groups where the treat- ment and control groups each have only one or several members or where there is no control whatsoever. However, in other respects the challenges posed by, and advantages accrued from, this sort of analysis are quite similar to the large N randomized design. Where cross-unit variance is minimal (by reason of the limited num- ber of units at oneâs disposal), the emphasis of the analysis necessarily shifts from spatial evidence (across units) to evidence garnered from temporal variation (i.e., to a comparison of pre- and posttests in the treated units). Naturally, one wants to maximize the number of treated units and the number of untreated controls. This can be achieved by a modified ârolloutâ protocol. Note that in a large N randomized setting (as described above), the purpose of rollout procedures is usually (1) to test a complex treatment (e.g., where multiple treatments or combinations of treatments are being tested in a single research design) or (2) for purposes of distributing a valued good among the population while preserving a control group. The most crucial issue is to maximize useful variation on the available units. This can be achieved by testing each unit in a serial fashion, regarding the remaining (untreated) units as controls. Consider a treatment that is to be administered across six regions of a country. There are only six regions, so cross-unit variation is extremely lim- ited. To make the most of this evidence-constrained setting, the researcher may choose to implement five iterations of the same manipulated treat- ment, separated by some period of time (e.g., one year). During all stages of analysis, there remains at least one unit that can be regarded as a con- trol. This style of rollout provides five pre- and posttests and a continual (albeit shrinking) set of controls. As long as contamination effects are not severe, the results from this sort of design may be more easily interpreted than the results from a simple split-sample research design (i.e., treating three regions and retaining the others as a control group). In the latter any observed variation across treatment and control groups may be due to a confounding factor that coincides temporally and correlates spatially with the intervention. Despite the randomized nature of this intervention, it is still quite possible that other matters beyond the control of the investigator may intercede. It is not always possible to tell whether or not confounding factors are present in one or more of the cases. In a large N setting, we can be more confident that such confounding factors, if present, will be equally distributed across treatment and control groups. Not so for the small N setting. This is all the more reason to try to maximize experimen- tal leverage by setting in motion a rollout procedure that treats each unit
METHODOLOGIES OF IMPACT EVOLUTION 141 separately through time. Any treatment effects that are fairly consistent across the six cases are unlikely to be the result of confounding factors and are therefore interpretable as causal rather than spurious. Note that in a small population where all units are being treated, it is likely that there will be significant problems of contamination across units. In the scenario discussed above, for example, it is likely that untreated regions in a country will be aware of interventions implemented in other regions. Thus it is advisable to devise case selection and implementation procedures that minimize potential contamination effects. For example, in the rollout protocol discussed above, one might begin by treating regions that are most isolated, leaving the capital region for last. Regardless of the procedure for case selection, it will be important for researchers to pay close attention to potential changes before and after the treatment is administered. That is, in small N randomization designs, it is highly advisable to collect baseline data since the comparison groups are less likely to be similar enough to compare directly. In an example of a small N randomized evaluation, Glewwe et al (2007) used a very modest sample of 25 randomly chosen schools to evaluate the effect of the provision of textbooks on student test scores. A Dutch nonprofit organization provided textbooks to 25 rural Kenyan primary schools chosen randomly from a group of 100 candidate schools. The authors found no evidence that the project increased average scores, reduced grade repetition, or affected dropout rates (although they did find that the project increased the scores of the top two quintiles of those with the highest preintervention academic achievement). Evidently, sim- ply providing the textbooks only helped those who were already the most motivated or accomplished; in the absence of other changes (e.g., better attendance, more prepared or involved teachers), the books alone pro- duced little or no change in average studentsâ achievement. It is important to note that, like other forms of impact evaluation, this study required good baseline data to conduct its evaluation. 4. Small N Comparison In small N designs USAID may be unable to manipulate the temporal or spatial distribution of the treatment. In this context the evaluator faces the additional hurdle of not having sufficient cases to employ statistical procedures to correct for the biases that make identifying causal effects difficult when treatments cannot be manipulated. Nonetheless, there are still advantages to identifying units that will not be treated and gathering pre- and posttreatment measures of out- comes in both the treatment and control groups. A control group is useful here for (1) ruling out the possibility that the intervention coincided with a temporal change or trend that might account for observed changes in
142 IMPROVING DEMOCRACY ASSISTANCE the treatment group and (2) ensuring that application of the treatment was not correlated with other characteristics of the treated units that could explain observed differences between the treatment and control groups. Ideally, the control group in a small N comparison should be matched to the treatment group as precisely as possible. With large amounts of data, propensity score matching techniques can be used to identify a control group that approximates the treated units across a range of observables. When data are not widely available, a control group can be generated qualitatively by identifying untreated units that are similar to those in the treatment group on key dimensions (other than the treatment) that might affect the outcomes of interest. 5. N = 1 Study with USAID Control over Timing and Location of Treatment Sometimes, there is no possibility of spatial comparison. This is often the case where the unit of concern exists only at a national level (e.g., an electoral administration body), and nearby nation-states do not offer the promise of pre- or posttreatment equivalence. In this case the researcher is forced to reach causal inferences on the basis of a single case. Even so, the possibility of a manipulated treatment offers distinct advantages over the unmanipulated (observed) treatment. The ability to choose the timing of the intervention and plan observations to maximize the likelihood of accurate inferences can provide considerable leverage for credible con- clusions. However, these advantages accrue only if very careful attention is paid to the timing of the intervention, the nature of the intervention, its anticipated causal effect, and the pre- and posttreatment evidence that might be available. The challenge here is to overcome the problems that are already highlighted here with regard to simple before and after comparisons. First, with respect to timing, it is essential that the intervention occur during a period in which no other potentially contaminating factors are at work and in which the outcome factors being observed would be expected to be relatively stable; that is, a constant trend is expected, so that any changes in that trend are easily interpreted. Naturally, these matters lie partly in the future and therefore cannot always be anticipated. Nonethe- less, the delicacy of this research designâits extreme sensitivity to any violation of ceteris paribus assumptionsârequires the researcher to antici- pate what may occur, at least through the duration of the experiment. Second, with respect to detrending the data, it is helpful if the researcher can gather information on the outcome(s) of interest and any potential confounders for the periods before and after the intervention. The longer the period of observation, the more confident one can be about any causal inference made (Campbell 1968/1988). Thus, if the outcome
METHODOLOGIES OF IMPACT EVOLUTION 143 factor being studied has been stable for a long time before the interven- tion, and other factors likely to have an impact on the outcome have been ruled out, one can have more confidence that any observed change in the trend was due to the intervention. Third, with respect to the intervention itself, it is essential that it be discrete and significant enough to be easily observed. While subtle project effects may be detected in a large N randomized design, usually only very large effects can be confidently observed in a single-case setting. Fourth, it is helpful if the intervention has more than one observable (and policy-significant) effect. This goes some way toward resolving the ever-present threats of measurement error and confounding causes. If, for example, a given intervention is expected to produce changes in three measurable independent outcomes, and all three factors change in the aftermath of an intervention, it is less likely that the noted association is spurious. 6. N = 1 Comparison When the unit of concern exists only at the national level and the treatment cannot be manipulated by USAID, discerning causal effects is extraordinarily difficult. Observed differences in outcome measures pre- and posttreatment can be interpreted as causal effects only if the evaluator can make the case that other factors were not important. Some of the strategies described above are applicable in an N = 1 comparison if the treatment can be interpreted âas ifâ it was manipulated (e.g., Miron 1994). Any demonstration of a large discontinuous change in an outcome of interest following the treatment increases confidence in the causal interpretation of the effect. This requires an effort to measure the outcome(s) of interest prior to, and after, the intervention. In some cases it may be possible to identify units for comparison within the country or outside the country, in order to rule out obvious temporal confounds. Take the example of an anticorruption effort funded in a specific ministry. If it can be shown that corruption levels remained unchanged in untreated ministries while shifting dramatically in a treated ministry, we gain confidence that a government-wide anticorruption effort cannot account for the effects observed in the treated ministry. But the possibility cannot be ruled out that other developments in the treated ministry (such as good leadership) are more important than the interven- tion in accounting for the outcome. Or take the example of a national anticorruption effort that is rolled out in one country but not in adjacent countries or at different times in adjacent countries. Changes in outcome variables in the other countries could be tracked to seek the effects of the program; if reductions in corruption occur to a greater degree, or in a timed sequence that corresponds to the timing of roll-outs in different
144 IMPROVING DEMOCRACY ASSISTANCE countries, one can have confidence that it is not regional or global trends that were driving the reductions in corruption. On the other hand, as in the previous example, the possibility could not be ruled out that other fac- tors, such as freer media or stronger leadership, were the key causal fac- tors in reducing corruption rather than the specific USAID project, unless there were also measures of those possible confounding factors. Not all USAID DG programs need to be subjected to rigorous impact evaluation. For example, if USAID is working to help a country pass a new constitution with certain human rights provisions, and several other NGOs and foreign countries are also working to that end, it may not matter how much USAIDâs specific activities contributed to a success- ful outcome; success is what matters and credit can be shared among all who contributed. (On the other hand, a subsequent impact evaluation of whether the new constitution actually resulted in an improvement in human rightsâan N = 1 comparison designed to plot changes in human rights violations over time and look for sharp reductions following adop- tion of the new constitutionâmay be worthwhile.) In particular, the random assignment mode of impact evaluation is probably best used only where the fair assignment of assistance naturally results in a randomized assignment of aid or where USAID uses a proj- ect in so many places, or invests so much in a project, that it is of great importance to be confident of that projectâs effectiveness. In most settings, worthwhile insights into project impacts can be derived from designs that include small N comparisons, as long as good baseline, outcome, and comparison group data are collected. Examples of the Use of Randomized Evaluations in Impact Evaluations of Development Assistance (Including DG Projects) Randomized designs have a high degree of internal validity. By per- mitting a comparison of outcomes in a treatment group and a control group that can be considered identical to one another, they do a better job than any other evaluation technique of permitting evaluators to identify the impact of a given intervention. It is no surprise, therefore, that ran- domized evaluation is the industry standard for the assessment of new medications. It is inconceivable that a pharmaceutical company would be permitted to introduce a new medication into the market unless evidence from a randomized evaluation proved its benefits. Yet as discussed in Chapter 2, for the assessment of DG assistance programs, impact evalua- tions have rarely been employed. This leaves USAID in the difficult posi- tion of spending hundreds of millions of dollars on assistance programs without proven effects.
METHODOLOGIES OF IMPACT EVOLUTION 145 There are a small, but important, number of large N randomized impact evaluations that have been carried out to test the effects of assis- tance programs. Classic evaluations, such as the RAND health insurance study and the evaluation of the Job Training Partnership Act (JTPA), stand out as exemplars of large-scale assessments of social assistance programs (Wilson 1998, Gueron and Hamilton 2002, Newhouse 2004). A few have been done in developing countries; the evaluation of Mexicoâs conditional cash transfer program, Progresa/Oportunidades, continues to shape the design of similar programs in other contexts (Morley and Coady 2003). The number of such evaluations is growing. In fields as diverse as public health, education, microfinance, and agricultural development, randomized evaluations are increasingly employed to assess project effec- tiveness. Examples abound in the field of public health: Studies have assessed the efficacy of male circumcision in combating HIV (Auvert et al 2005), the impact of HIV prevention programs on sexual behavior (Dupas 2007), the effectiveness of bed nets for reducing the incidence of malaria (Nevill et al 1996), the impact of deworming drugs on health and educa- tional achievement (Miguel and Kremer 2004), and the role of investments in clean water technologies on health outcomes (Kremer et al 2006). In education, randomized evaluations have been used to explore the efficacy of conditional cash transfers (Schultz 2004), school meals (Vermeersch and Kremer 2004), and school uniforms and textbooks (Kremer 2003) on school enrollment; the effectiveness of additional inputs, such as teacher aids, on school performance (Banerjee and Kremer 2002); and the impact of school reforms, such as voucher programs, on academic achievement (Angrist et al 2006). In microfinance, attention has focused on the impact of programs on household welfare (Murdoch 2005); randomized evalua- tions in agricultural development are exploring the benefits and impedi- ments to the adoption of new technologies, such as hybrid seeds and fertilizer (Duflo et al 2006a). Thus far, however, these approaches have not been applied to the evaluation of DG programs. A significant part of the explanation for this is that it is often more difficult to measure outcomes in the area of demo- cratic governance. Most successful randomized evaluations have been conducted in areas such as health and education, where it is much more straightforward to measure outcomes. For example, the presence of intes- tinal parasites can be measured quite easily and accurately via stool sam- ples (as in Miguel and Kremer 2004); water quality can be assessed via a test for E. coli. content (as in Kremer et al 2006); nutritional improvements can be traced quite readily via height and weight measures; school perfor- mance or learning can be tracked easily via test scores (as in Banerjee et al 2007); and teacher absenteeism can be measured with attendance records (as in Banerjee and Duflo 2006). Developing valid and reliable measures
146 IMPROVING DEMOCRACY ASSISTANCE of the outcomes targeted by DG programs is much more difficult and stands as an important challenge for project evaluation in this area. The challenge is not insurmountable; there have been tremendous improve- ments over the past decade in the measurement of political participation and attitudes (Bratton et al 2005), social capital and trust (Grootaert et al 2004), and corruption (Bertrand et al 2007, Olken 2007). And as discussed in Chapter 2, USAID has made significant efforts to develop outcome indicators to support its project M&E work. This chapter closes with two examples of impact evaluations using randomized designs applied to DG subjects that tested commonly held programming assumptions. The first addresses the issue of corruption. USAID invests significant resources every year in anticorruption initia- tives, but questions remain about the efficacy of such investments. Which programs yield the biggest impact in terms of reducing corruption? Some have argued that corruption can be reduced with the right combination of monitoring and incentives provided from above (Becker and Stigler 1974). Of course, the challenge with top-down monitoring is that higher level officials may themselves be corruptible. An alternative approach has emphasized local-level monitoring (World Bank 2004). The argument is that community members have the strongest incentives to police the behavior of local officials, as they stand to benefit the most from local public goods provision. Yet this strategy also has its drawbacks: Individu- als may not want to bear the costs of providing oversight, preferring to leave that to others, or community members may be easily bought off by those engaged in corrupt practices. Which strategy most effectively reduces corruption? Olken (2007) set out to answer this question in Indonesia through a unique partnership with the World Bank. As a nationwide village-level road-building project was rolled out, Olken randomly selected one set of villages to be subject to an external audit by the central government, a second set in which extensive efforts were made to mobilize villagers to participate in oversight and accountability meetings, a third set in which the accountability meetings were complemented by an anonymous mechanism for raising complaints about corruption in the project, and a fourth set reserved as a control group. To measure the efficacy of these different strategies, Olken constructed a direct measure of corruption: He assembled a team of engineers and surveyors who, after the projects were completed, dug core samples in each road to estimate the quantity of materials used, interviewed villagers to determine the wages paid, and surveyed suppliers to estimate local prices to construct an independent estimate of the cost of the project. The difference between the reported expenditures by the village and this independent estimate provides a direct measure of corruption. His findings strongly suggest the efficacy of
METHODOLOGIES OF IMPACT EVOLUTION 147 external audits: Missing expenditures were eight percentage points lower in villages subject to external monitoring. The results were less impressive for grassroots monitoring. While community members did participate in the accountability meetings in higher numbers in villages where special mobilization efforts were undertaken and they did discuss corruption- related problems (and even took action at times), no significant reductions in the level of corruption were observed. If one had relied on only the out- put measures or observation that characterizes many USAID M&E efforts (e.g., number of participants in community events supported by USAID programs), it might have mistakenly been concluded from the level of community participation that grassroots monitoring was making a sub- stantial difference. But Olkenâs more careful methodology led him to the opposite conclusion. While there are undoubtedly benefits to mobilizing community participation for a variety of other purposes, it appears that if the goal is to reduce local corruption, supporting more external audits is considerably more effective. Another example is the question of how best to promote a robust and vibrant civil society. USAID regularly makes substantial investments in civil society organizations (CSOs) and local NGOs with the hope of empowering the disadvantaged, building trust, enhancing cooperation, and supporting the flourishing of democratic institutions (Putnam 1993, 2000). Yet some skeptics have warned that outside support for CSOs might be counterproductive: It may produce more professionally run organiza- tions that no longer have strong ties to their grassroots base (Skocpol 2003) and may actually change the leadership of such organizations, disempow- ering the disadvantaged (Igoe 2003). Knowing whether outside assistance helps or harms CSOs is a question of vital importance, and randomized evaluations have begun to offer some preliminary evidence. Gugerty and Kremer (2006) conducted a randomized evaluation in which a sample of womenâs self-help associations in rural Western Kenya were randomly selected to receive a package of assistance that included organizational and management training as well as a set of valuable agri- cultural inputs such as tools, seeds, and fertilizer. Forty groups received assistance in the first year, while an additional 40 eligible groups served as the control group (although they were given the same assistance, just two years later). The results are disturbing for advocates of outside funding to community groups. While members of the funded groups reported higher levels of satisfaction with their group leadership, there is little evidence that objective measures of group activity improved. Moreover, Gugerty and Kremer found that outside funding changed the nature of the group and its leadership. Younger, more educated women and women from the formal sector increasingly joined the group, and these new entrants tended to assume leadership positions and to displace older women.
148 IMPROVING DEMOCRACY ASSISTANCE Compared to their unfunded counterparts, funded groups experienced a two-thirds increase in the exit rate of older womenâa troubling finding given the programâs underlying objective of empowering the disempow- ered. Whereas an analysis of group membersâ satisfaction would have led project evaluators to conclude that the project was a success, the careful randomized design led Gugerty and Kremer to the opposite conclusion (and generated significant evidence that the skeptics may be right about the sometimes counterproductive impact of donor funding to CSOs). These two examples serve to support a broader point: It is both pos- sible and important to conduct randomized impact evaluations of projects designed to support DG. In both cases the randomized evaluations effec- tively measured a projectâs impact, but they also provided new evidence about implicit hypotheses that guide programming more broadly. In the case of corruption, the implicit hypothesis was that community empow- erment is an antidote to local-level corruption; in the case of civil society support, the hypothesis was that donors can spur the growth of a vibrant civil society that empowers the disadvantaged through outside support. The evidence casts some doubt on both hypotheses and should encourage further evaluations to see if these results hold more broadly and perhaps fuel the search for alternative methods to support DG goals. The larger point, however, is not so much the findings of these stud- ies as the fact that they were successfully conducted on DG projects. The next chapter describes the findings of the committeeâs field studies and discusses how these designs could be applied to the evaluation of several of USAIDâs own current DG projects. It also explicitly addresses some of the common objections to using randomized evaluations more widely. REFERENCES Angrist, J.D., and Lavy, V. 1999. Using Maimonidesâ Rule to Estimate the Effect of Class Size on Scholastic Achievement. Quarterly Journal of Economics 114:533-575. Angrist, J.D., Bettinger, E., and Kremer, M. 2006. Long-Term Consequences of Secondary School Vouchers: Evidence from Administrative Records in Colombia. American Eco- nomic Review 96:847-862. Auvert, B., Taljaard, D., Lagarde, E., Sobngwi-Tambekou, J., Sitta, R., and Puren, A. 2005. Randomized, Controlled Intervention Trial of Male Circumcision for Reduction of HIV Infection Risk: The ANRS 1265 Trial. PLOS Medicine. Available at: http://medicine. plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pmed.0020298&ct=1. Accessed on February 23, 2008. Banerjee, A.V., and Duflo, E. 2006. Addressing Absence. Journal of Economic Perspectives 20(1):17-132. Banerjee, A.V., and Kremer, M. 2002. Teacher-Student Ratios and School Performance in Udaipur, India: A Prospective Evaluation. Washington, DC: Brookings Institution. Banerjee, A.V., Cole, S., Dutlo, E., and Linden, L. 2007. Remedying Education: Evidence from Two Randomized Experiments in India. Quarterly Journal of Economics 122(3):1235- 1264.
METHODOLOGIES OF IMPACT EVOLUTION 149 Becker, G.S., and Stigler, G.J. 1974. Law Enforcement, Malfeasance, and the Compensation of Enforcers. Journal of Legal Studies 3:1-19. Bertrand, M., Duflo, E., and Mullainathan, S. 2004. How Much Should We Trust Difference- In-Difference Estimates? Quarterly Journal of Economics 119(1):249-275. Bertrand, M., Djankov, S., Hanna, R., and Mullainathan, S. 2007. Obtaining a Driving Li- cense in India: An Experimental Approach to Studying Corruption. Quarterly Journal of Economics 122:1639-1676. Bloom, H.S., ed. 2005. Learning More from Social Experiments: Evolving Analytic Approaches. New York: Russell Sage Foundation. Bollen, K., Paxton, P., and Morishima, R. 2005. Assessing International Evaluations: An Example from USAIDâs Democracy and Governance Programs. American Journal of Evaluation 26:189-203. Bratton, M., Mattes, R., and Gyimah-Boadi, E. 2005. Public Opinion, Democracy, and Market Reform in Africa. New York: Cambridge University Press. Campbell, D.T. 1968/1988. The Connecticut Crackdown on Speeding: Time-Series Data in Quasi-Experimental Analysis. Pp. 222-238 in Methodology and Epistemology for Social Science, E.S. Overman, ed. Chicago: University of Chicago Press. Devine, T.J., and Heckman, J.J. 1996. The Economics of Eligibility Rules for a Social Program: A Study of the Job Training Partnership Act (JTPA)âA Summary Report. Canadian Journal of Economics 29(Special Issue: Part 1):S99-S104. Duflo, E. 2000. Schooling and Labor Market Consequences for School Construction in Indo- nesia. Cambridge, MA: MIT Dept. of Economics Working Paper No. 00-06. Duflo, E., Kremer, M., and Robinson, J. 2006a. Understanding Technology Adoption: Fertil- izer in Western Kenya. Evidence from Field Experiments. Available at: http://www.econ. berkeley.edu/users/webfac/saez/e231_s06/esther.pdf. Accessed February 23, 2008. Duflo, E., Glennerster, R., and Kremer, M. 2006b. Using Randomization in Development Eco- nomics Research: A Toolkit. Unpublished paper. Massachusetts Institute of Technology and Abdul Latif Jameel Poverty Action Lab. Cambridge, MA. Dupas, P. 2007. Relative Risks and the Market for Sex: Teenagers, Sugar Daddies, and HIV in Kenya. Available at: http://www.dartmouth.edu/~pascaline/. Accessed on February 23, 2008. Gerring, J. 2007. Case Study Research: Principles and Practices. Cambridge: Cambridge Uni- versity Press. Gerring, J., and McDermott, R. 2007. An Experimental Template for Case-Study Research. American Journal of Political Science 51:688-701. Glewwe, P., Kremer, M., and Moulin, S. 2007. Many Children Left Behind? Textbooks and Test Scores in Kenya. NBER Working Paper No. 13300. Available at: http://papers/nber. org/papers/w13300. Accessed on April 26, 2008. Gueron, J.M., and Hamilton, G. 2002. The Role of Education and Training in Welfare Reform. Policy Brief No. 20. Washington, DC: Brookings Institution. Grootaert, C., Narayan, D., Jones, V.N., and Woolcock, M. 2004. Measuring Social Capital: An Integrated Questionnaire. World Bank Working Paper No. 18. Washington, DC: The World Bank. Gugerty, M.K., and Kremer, M. 2006. Outside Funding and the Dynamics of Participation in Community Associations. Background Paper. Washington, DC: World Bank. Available at http://siteresources.worldbank.org/INTPA/Resources/Training-Materials/OutsideFunding. pdf. Accessed on April 26, 2008. Hahn, J., Todd, P., and Van der Klaauw, W. 2001. Identification and Estimation of Treatment Effects with a Regression-Discontinuity Design. Econometrica 69(1):201-209. Heckman, J. 1997. Instrumental Variables: A Study of Implicit Behavioral Assumptions Used in Making Program Evaluations. Journal of Human Resources 32(3):441-462.
150 IMPROVING DEMOCRACY ASSISTANCE Heckman, J., Ichimura, H., and Todd, P. 1997. Matching as an Econometric Evaluation Es- timator: Evidence from Evaluating a Job Training Program. Review of Economic Studies 64:605-654. Hyde, S. 2006. The Observer Effect in International Politics: Evidence from a Natural Experi- ment. Unpublished paper. Yale University. Igoe, J. 2003. Scaling Up Civil Society: Donor Money, NGOs and the Pastoralist Land Rights Movement in Tanzania. Development and Change 34(5):863-885. Kremer, M. 2003. Randomized Evaluations of Educational Programs in Developing Coun- tries: Some Lessons. American Economic Review 93(2):102-106. Kremer, M., Leino, J., Miguel, E., and Zwane, A. 2006. Spring Cleaning: A Randomized Evaluation of Source Water Improvement. Available at http://economics.harvard.edu/ faculty/Kremer/files/springclean.pdf. Accessed on April 26, 2008. Meyer, B.D., Viscusi, W.K., and Durbin, D.L. 1995. Workersâ Compensation and Injury Dura- tion: Evidence from a Natural Experiment. American Economic Review 85(3):322-340. Miguel, E., and Kremer, M. 2004. Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities. Econometrica 72(1):159-217. Miron, J.A. 1994. Empirical Methodology in Macroeconomics: Explaining the Success of Friedman and Schwartzâs âA Monetary History of the United States, 1867-1960.â Journal of Monetary Economics 34:17-25. Morley, S., and Coady, D. 2003. Targeted Education Subsidies in Developing Countries: A Review of Recent Experiences. Washington, DC: Center for Global Development. Murdoch, J. 2005. The Economics of Microfinance. Cambridge, MA: MIT Press. Nevill, C.G., Some, E.S., Mungâala, V.O., Mutemi, W., New, L., Marsh, K., Lengeler, C., and Snow, R.W. 1996. Insecticide-Treated Bednets Reduce Mortality and Severe Morbidity from Malaria Among Children on the Kenyan Coast. Tropical Medicine and International Health 1:139-146. Newhouse, J.P. 2004. Consumer-Directed Health Plans and the RAND Health Insurance Experiment. Health Affairs 23(6):107-113. Olken, B.A. 2007. Monitoring Corruption: Evidence from a Field Experiment in Indonesia. Journal of Political Economy 115:200-249. Putnam, R. 1993. Making Democracy Work: Civic Traditions in Modern Italy. Princeton, NJ: Princeton University Press. Putnam, R. 2000. Bowling Alone. New York: Simon and Schuster. Schultz, T.P. 2004. School Subsidies for the Poor: Evaluating the Mexican PROGRESA Pov- erty Program. Journal of Development Economics 74(1):199-250. Senge, P. 2006. The Fifth Discipline: The Art and Practice of the Learning Organization, Rev. Ed. New York: Doubleday. Skocpol, T. 2003. Diminished Democracy: From Membership to Management in American Civic Life. Julian T. Rothbaum Lecture Series, Vol. 8. Norman: University of Oklahoma Press. Shadish, W.R., Cook, T.D., and Campbell, D.T. 2002. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston: Houghton-Mifflin. Trochim, W., and Donnelly, J.P. 2007. The Research Methods Knowledge Base, 3rd ed. Cincinnati, OH: Atomic Dog Publishing. Vermeersch, C., and Kremer, M. 2004. School Meals, Educational Achievement, and School Competition: Evidence from a Randomized Evaluation. World Bank Policy Research Working Paper No. 3523. Washington, DC: The World Bank. Wholey, J.S., Hatry, H.P., and Newcomer, K.E., eds. 2004. Handbook of Practical Program Evaluation, 2nd ed. San Francisco: Jossey-Bass. Wilson, E.O. 1998. Consilience: The Unity of Knowledge. New York: Knopf. World Bank. 2004. Monitoring & Evaluation: Some Tools, Methods & Approaches. Washington, DC: International Bank for Reconstruction, The World Bank.