5
Methodologies of Impact Evaluation

INTRODUCTION

This chapter presents a guide to impact evaluations as they are currently practiced in the field of foreign assistance. The committee recognizes, as stated before, that the application of impact evaluations to foreign assistance in general, and to democracy and governance (DG) projects in particular, is controversial. The purpose of this chapter is thus to present the range of impact evaluation designs, as a prelude to the results of the committee’s field teams’ exploration of their potential application as part of the mix of evaluations and assessments undertaken by the U.S. Agency for International Development (USAID) presented in the next two chapters.

The highest standard of credible inference in impact evaluation is achieved when the number of people, villages, neighborhoods, or other groupings is large enough, and the project design flexible enough, to allow randomized assignment to treatment and nontreatment groups. Yet the committee realizes that this method is often not practical for many DG projects. Thus this chapter also examines credible inference designs for cases where randomization is not possible and for projects with a small number of units—or even a single case—involved in the project.

Some of the material in this chapter is somewhat technical, but this is necessary for this chapter to serve, as the committee hopes it will, as a guide to the design of useful and credible impact evaluations for DG missions and implementers. The technical terms used here are defined in the chapter text and also in the Glossary at the end of the report. Also,



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 119
5 Methodologies of Impact Evaluation INTRODUCTION This chapter presents a guide to impact evaluations as they are cur- rently practiced in the field of foreign assistance. The committee rec- ognizes, as stated before, that the application of impact evaluations to foreign assistance in general, and to democracy and governance (DG) projects in particular, is controversial. The purpose of this chapter is thus to present the range of impact evaluation designs, as a prelude to the results of the committee’s field teams’ exploration of their potential application as part of the mix of evaluations and assessments undertaken by the U.S. Agency for International Development (USAID) presented in the next two chapters. The highest standard of credible inference in impact evaluation is achieved when the number of people, villages, neighborhoods, or other groupings is large enough, and the project design flexible enough, to allow randomized assignment to treatment and nontreatment groups. Yet the committee realizes that this method is often not practical for many DG projects. Thus this chapter also examines credible inference designs for cases where randomization is not possible and for projects with a small number of units—or even a single case—involved in the project. Some of the material in this chapter is somewhat technical, but this is necessary for this chapter to serve, as the committee hopes it will, as a guide to the design of useful and credible impact evaluations for DG missions and implementers. The technical terms used here are defined in the chapter text and also in the Glossary at the end of the report. Also, 

OCR for page 119
0 IMPROVING DEMOCRACY ASSISTANCE examples are provided to show how such designs have already been implemented in the field for various foreign assistance and democracy assistance programs. IMPORTANCE OF SOUND AND CREDIBLE IMPACT EvALUATIONS FOR Dg ASSISTANCE As discussed in some detail in Chapter 2, until 1995 USAID required evaluations of all its projects, including those in DG, to assess their effec- tiveness in meeting program goals. Most of the evaluations, however, were process evaluations: post-hoc assessments by teams of outside experts who sought to examine how a project unfolded and whether (and why) it met anticipated goals. While these were valuable examinations of how projects were implemented and their perceived effects, such evaluations generally could not provide the evidence of impact that would result from sound impact assessments. This was because in most cases they lacked necessary baseline data from before the project was begun and because in almost all cases they did not examine appropriate comparison groups to determine what most likely would have occurred in the absence of the projects (see Bollen et al [2005] for a review of past DG evaluations). As noted, the number of such evaluations undertaken by USAID has declined in recent years. Evaluations are now optional and are conducted mainly at the discretion of individual missions for specific purposes, such as when a major project is ending and a follow-on is expected or when a DG officer feels that something has “gone wrong” and wants to understand and document the reasons for the problem. Such special evaluations can have substantial value for management purposes, but the committee believes that USAID is overlooking a major opportunity to learn systematically from its experience about project success and failure by not making impact evaluations a significant part of its monitoring and evaluation (M&E) activities where appropriate and feasible. Such impact evaluations could be particularly useful to provide insights into the effects of its largest-scale and most frequently used projects and to test key devel- opment hypotheses that guide its programming. There are three fundamental elements of sound and credible impact evaluations. First, such evaluations require measures relevant to desired project outcomes, not merely of project activity or outputs. Second, they require good baseline, in-process, and endpoint measures of those out- comes to track the effects of interventions over time. Finally, they require comparison of those who receive assistance with appropriate nontreatment groups to determine whether any observed changes in outcomes are, in fact, due to the intervention. The committee’s discussions with USAID staff, contractors for USAID,

OCR for page 119
 METHODOLOGIES OF IMPACT EVOLUTION and our own field study of USAID missions have shown that, even within the current structure of project monitoring, USAID is already engaged in pursuing the first and second requirements. While in some cases prog- ress remains to be made on devising appropriate outcome measures and in ensuring the allocation of time and resources to collect baseline data, USAID has generally recognized the importance of these tasks. These efforts do vary from mission to mission, according to their available resources and priorities, so considerable variation remains among mis- sions and projects in these regards. However, the committee found that there is little or no evidence in current or past USAID evaluation practices that indicates the agency is making regular efforts to meet the third requirement—comparisons. With rare exceptions, USAID evaluations and missions generally do not allocate resources to baseline and follow-up measurements on nonin- tervention groups. Virtually all of the USAID evaluations of which the committee is aware focus on studies of groups that received USAID DG assistance, and estimates of what would have happened in the absence of such interventions are based on assumptions and subjective judgments, rather than explicit comparisons with groups that did not receive DG assistance. It is this almost total absence of comparisons with nontreated groups, more than any other single factor, that should be addressed in order to draw more credible and powerful conclusions about the impact of USAID DG projects in the future. To briefly illustrate the importance of conducting baseline and follow- up measurements for both treated and nontreated comparison groups, consider the following two simple examples: 1. A consulting firm claims to have a training program that will make legislators more effective. To demonstrate the program’s effectiveness, the firm recruits a dozen legislators and gives them all a year of training. The firm then measures the number of bills those legislators have introduced in parliament in the year prior to the training and the number of bills introduced in the year following the training and finds that each legisla- tor increased the number of bills he or she had introduced by 30 to 100 percent! Based on this the consultants claim they have demonstrated the efficacy of the program. Yet to know whether or not the training really was effective, we would need to know how much each legislator’s performance would have changed if he or she had not taken the training program. One way of answering this question is to compare the performance of the legisla- tors who were trained to the performance of a comparable set of legisla- tors who were not. When someone points this out to the consultants and they go back and measure the legislative activity of all the legislators for

OCR for page 119
 IMPROVING DEMOCRACY ASSISTANCE the prior year, they find that the legislators who were not in the training group introduced, on average, exactly the same number of bills as those who were trained. What has happened? It is possible that the increase in the number of bills presented by all legislators resulted from greater experience in office, so that everyone introduces more bills in his or her third year in office than in the first year. Or there may have been a rule change, or policy pressures, that resulted in a general increase in legislative activity. Thus it is entirely possible that the observed increase in legislative activity by those trained had nothing to do with the training program at all, and the program’s effect might have been zero. Or it is possible that those legislators who signed up for the program were an unusual group. They might have been those legislators who were already the most active and who wanted to increase their skills. Thus the program might have worked for them but would not have worked for others. Another possibility is that the legislators who signed up were those who were the least actie and who wanted the training to enable them to “catch up” with their more active colleagues. In this case the results do show merit to the training program, but again it is not clear how much such a program would help the average legislator improve. The only way to resole these arious possibilities would be to hae taken measures of legislatie actiity before and after the training program for both those legislators in the program and those not in the program. While it would be most desirable to have randomly assigned legislators to take the train- ing or not, that is not necessary for the before and after comparison measures to still yield valuable and credible information. For example, even if legislators themselves chose who would receive the training, we would want to know whether the trained group had previously been more active, or less active, than their colleagues not receiving training. We could also then make statistical adjustments to the comparison, reflecting differences in prior legislative activity and experience between those who were trained and those who were not, to help determine what the true impact of the training program was, net of other factors that the training could not affect. In short, simply knowing that a training program increased the leg- islative activity of those trained does not allow one to choose between many different hypotheses regarding the true impact of that program, which could be zero or highly effective in providing “catch-up” skills to legislators who need them. The only way to obtain sound and credible judgments of a program’s effect is with before and after measurements on both the treatment and the relevant nontreatment groups. 2. The same consulting firm also claims to have a program that will increase integrity and reduce corruption among judges. To test the

OCR for page 119
 METHODOLOGIES OF IMPACT EVOLUTION program’s effectiveness, the firm recruits a dozen judges to receive the program’s training for a year. When the consultants examine the rate of perceived bribery and corruption, or count cases thrown out or settled in favor of the higher status plaintiff or defendant, in those courts where the judges were trained, they find that there has been no reduction in those measures of corruption. On this basis the donor might decide that the pro- gram did not work. However, to really reach this conclusion, the donor would have to know whether, and how much, corruption would have changed if those judges had not received the training. When the donor asks for data on perceived bribery and corruption, or counts of cases thrown out or settled in favor of higher status plaintiffs or defendants, in other courts it turns out to be much higher than in the courts where judges did receive the training. Again, the new information forces us to ask: What really happened? It is possible that opportunities for corruption increased in the country, so that most judges moved to higher levels of corruption. In this case the constant level of corruption observed in the courts whose judges received training indicated a substantially greater ability to resist those opportunities. So, when properly evaluated against a comparison group, it turns out that the program was, in fact, effectie. To be sure, however, it would be valuable to also have baseline data on corruption levels in the courts whose judges were not trained; this would confirm the belief that corruption levels increased generally except in those courts whose judges received the program. Without such data it is not known for certain whether this is true or whether the judges who signed up for the train- ing were already those who were struggling against corruption and who started with much lower rates of corruption than other courts. These examples underscore the vital importance of comparisons with groups not receiing the treatment in order to avoid misleading errors and to accurately evaluate project impacts. From a public policy standpoint, the cost of such errors can be high. In the examples given here, it might have caused aid programs to waste money on training programs that were, in fact, ineffective. Or it might have led to cuts in funding for anticorrup- tion programs that were, in fact, highly valuable in preventing substantial increases in corruption. This chapter discusses how best to obtain comparisons for evaluat- ing USAID democracy assistance projects. Such comparisons range from the most rigorous possible—comparing randomly chosen treatment and nontreatment groups—to a variety of less exacting but still highly use- ful comparisons, including multiple and single cases, time series, and matched case designs. It bears repeating: The goal in all of these designs is to evaluate projects by using appropriate comparisons in order to increase confidence in drawing conclusions about cause and effect.

OCR for page 119
 IMPROVING DEMOCRACY ASSISTANCE PLAN OF THIS CHAPTER The chapter begins with a discussion of what methodologists term “internal” and “external” validity. Internal validity is defined as “the approximate truth of inferences regarding cause-effect or causal relation- ships” (Trochim and Donnelly 2007:G4). The greater the internal validity, the greater the confidence one can have in the conclusions that a given project evaluation reaches. The paramount goal of evaluation design is to maximize internal validity. External validity refers to whether the con- clusions of a given evaluation are likely to be applicable to other projects and thereby contribute to understanding in a general sense what works and what does not. Given that USAID implements similar projects in multiple country settings, the external validity of the findings of a given project evaluation is particularly important. This section of the chap- ter also stresses the importance of what the committee terms “building knowledge.” The second part of the chapter outlines a typology of evaluation meth- odologies that USAID missions might apply in various circumstances to maximize their ability to assess the efficacy of their programming in the DG area. Large N randomized designs permit the most credible infer- ences about whether a project worked or not (i.e., the greatest internal validity). By comparison, the post-hoc assessments that are the basis of many current and past USAID evaluations provide perhaps the least reli- able basis for inferences about the actual causal impact of DG assistance. Between these two ends of the spectrum lie a number of different evalu- ation designs that offer increasing levels of confidence in the inferences one can make. In describing these various evaluation options, the approach taken in this chapter is largely theoretical and academic. Evaluation strategies are compared and contrasted based on their methodological strengths and weaknesses, not their feasibility in the field. While a first step is taken at the end of the chapter in the direction of exploring whether the most rigorous evaluation design—large N randomized evaluation—is feasible for many DG projects, a more extensive treatment of this key question is reserved for the chapters that follow, when the committee presents the findings of its field studies, in which the feasibility of various impact evaluation designs is explored for current USAID DG programs with mission directors and DG staff. POINTS OF CLARIFICATION Before plunging into the discussion of evaluation methodologies, a few important points of clarification are needed. First, it should be clear that the committee’s focus on impact evaluations is not intended to deny

OCR for page 119
 METHODOLOGIES OF IMPACT EVOLUTION the need for, or imply the unimportance of, other types of M&E activities. The committee recognizes that monitoring is vital to ensure proper use of funds and that process evaluations are important management tools for investigating the implementation and reception of DG projects. This report focuses on how to develop impact evaluations because the com- mittee believes that at present this is the most underutilized approach in DG program evaluations and that therefore USAID has the most to gain if it is feasible to add sound and credible impact evaluations to its portfolio of M&E activities. Second, the committee recognizes that not all projects need be, or should be, chosen for the most rigorous forms of impact evaluation. Doing so would likely impose an unacceptably high cost on USAID’s DG programming. The committee is therefore recommending that such evaluations initially only be undertaken for a select few of USAID’s DG programs, a recommendation emphasized in Chapter 9. The committee does believe, however, that DG officers should be aware of the potential value of obtaining baseline and comparison group information for proj- ects to which they attach great importance, so that they can better decide how to develop the mix of M&E efforts across the various projects that they oversee. Third, before beginning the task of evaluating a project, precisely what is to be evaluated must be defined. Evaluating a project requires the identification of the specific intervention and a set of relevant and measur- able outcomes thought to result from that policy intervention. Even this apparently simple task can pose challenges, since most DG programs are complex (compound) interventions, often combining several activities (e.g., advice, formal training, monetary incentives) and are often expected to produce several desired outcomes. A project focused on the judiciary, for example, may include a range of different activities intended to bolster the independence and efficiency of the judiciary in a country and might be expected to produce a variety of outcomes, including swifter process- ing of cases, greater impartiality among plaintiffs and defendants, greater conformity to statutes or precedents, and greater independence vis-à-vis the executive. The evaluator must therefore decide whether to test the whole project or parts of the project or whether it would make sense, as discussed further below, to reconfigure the project to allow for clearer impact evaluation of specific interventions. As USAID’s primary focus will always be on program implementa- tion, rather than evaluation per se, evaluators will need to respond to the challenges posed by often ambitious and multitasked programs. At this point, a note on terminology is required. As noted above, an “activity” is defined as the most basic sort of action taken in the field, such as a training camp, a conference, advice rendered, money tendered,

OCR for page 119
 IMPROVING DEMOCRACY ASSISTANCE and so forth. A “project” is understood to be an aggregation of activities, including all those mentioned in specific USAID contracts with imple- menters, such as in requests for proposals and in subsequent documents produced in connection with these projects. A project can also be referred to as an “intervention” or “treatment.” The question of what constitutes an appropriate intervention is a criti- cal issue faced by all attempts at project evaluation. A number of factors impinge on this decision. Lumping activities within a given project together for evaluation often makes sense. If all parts of a program are expected to contribute to common outcomes, and especially if the bundled activities will have a stronger and more readily observed outcome than the separate parts, then treating the set of activities together as a single intervention may be the best way to proceed. In other cases, trying to separate various activities and measuring their impact may be preferred. The value of disaggregation seems clear from the standpoint of impact evaluation. After all, if only one part of a five-part program is in fact producing 90 percent of the observed results, this would be good to know, so that only that one part continues to be supported. But whether or not such a separation seems worth testing really depends on whether it is viable to offer certain parts of a project and not others. Sometimes it is possible to test both aggregated and disag- gregated components of a project in a single research design. This requires a sufficient number of cases to allow for multiple treatment groups. For example, Group A could receive one part of a program, Group B could receive two parts of a program, Group C could receive three parts of a program, and another group would be required as a control. In this example, three discrete interventions and their combination could be evaluated simultaneously. Many additional factors may impinge on the crafting of an appropri- ate design for impact evaluation of a particular intervention. These are reviewed in detail in the subsequent section. The committee understands that there is no magic formula for deciding when an impact evaluation might be desirable or which design is the best trade-off in terms of costs, need for information, and policy demands. What is clear, however, is that since impact evaluations are, in effect, tests of the hypothesis that a given intervention will create different outcomes than would be observed in the absence of that intervention, how well one specifies that hypothesis greatly influences what one will find at the end of the day. The question asked determines the sort of answers that can be received. The committee wants to flag this as a critical issue for USAID policymakers and project implementers to consider; further suggestions are given in Chapters 8 and 9 for how this could be addressed as part of an overall Strategic and

OCR for page 119
 METHODOLOGIES OF IMPACT EVOLUTION Operational Research Agenda project for learning about DG program effectiveness to guide policy programming. INTERNAL vALIDITy, ExTERNAL vALIDITy, AND BUILDINg kNOWLEDgE Internal validity A sound and credible impact evaluation has one primary goal: to determine the impact of a particular project in a particular place at a par- ticular time. This is usually understood as a question of internal validity. In a given instance, what causal effect did a specific policy intervention, X, have on a specific outcome, Y? This question may be rephrased as: If X were removed or altered, would Y have changed? Note that the only way to answer this question with complete cer- tainty is to go back in time to replay history without the project (called the “the counterfactual”). Since that cannot be done, we try to come as close as possible to the “time machine” by holding constant any background features that might affect Y (the ceteris paribus conditions) while altering X, the intervention of interest. We thus replay the scenario under slightly different circumstances, observing the result (Y). It is in determining how best to simulate this counterfactual situation of replaying history without the intervention that the craft of evaluation design comes into play. Indeed, a large literature within the social sciences is devoted to this question—often characterized as a question of causal assessment or research design (e.g., Shadish et al 2002, Bloom 2005, Duflo et al 2006b). The following section attempts to reduce this complicated set of issues down to a few key ingredients, recognizing that many issues can be treated only superficially. Consider that certain persistent features of research design may assist us in reaching conclusions about whether X really did cause Y: (1) inter- ventions that are simple, strong, discrete, and measurable; (2) outcomes that are measurable, precise, determinate, immediate, and multiple; (3) a large sample of cases; (4) spatial equivalence between treatment and control groups; and (5) temporal equivalence between pre- and posttests. Each of these is discussed in turn. 1. The intervention: discrete, with immediate causal effects, mea- surable. A discrete intervention that registers immediate causal effects is easier to test because only one pre- and posttest is necessary (perhaps only a posttest if there is a control group and trends are stable or easily neu- tralized by the control). That is, information about the desired outcome is collected before and after the intervention. By contrast, an intervention

OCR for page 119
 IMPROVING DEMOCRACY ASSISTANCE that takes place gradually, or has only long-term effects, is more difficult to test. A measurable intervention is, of course, easier to test than one that is resistant to operationalization (i.e., must be studied through proxies or impressionistic qualitative analysis). 2. The outcome(s): measurable, precise, determinate, and multiple. The best research designs feature outcomes that are easily observed, that can be readily measured, where the predictions of the hypotheses guiding the intervention are precise and determinate (rather than ambiguous), and where there are multiple outcomes that the theory predicts, some of which may pertain to causal processes rather than final policy outcomes. The lat- ter is important because it provides researchers with further evidence by which to test (confirm or disconfirm) the underlying hypothesis linking the intervention to the outcome and to elucidate its causal mechanisms. 3. Large sample size. N refers here to the number of cases that are available for study in a given setting (i.e., the sample size). A larger N means that one can glean more accurate knowledge about the effective- ness of the intervention, all other things being equal. Of course, the cases within the sample must be similar enough to one another to be compared; that is, the posited causal relationship must exist in roughly the same form for all cases in the sample or any dissimilarities must be amenable to post- hoc modeling. Among the questions to be addressed are: How large is the N? How similar are the units (cases) in respects that might affect the pos- ited causal relationships? If dissimilar, can these heterogeneous elements be neutralized by some feature of the research design (see below)? 4. Spatial equivalence (between treatment and control groups). By pure spatial comparisons what is meant are controls that mirror the treat- ment group in all ways that might affect the posited causal relationship. The easiest way to achieve equivalence between these two groups is to choose cases randomly from the population. Sometimes, nonrandomized selection procedures can be achieved, or exist naturally, that provide equivalence, but this is relatively rare. The key question to ask is always: How similar are the treatment and control groups in ways that might affect the intended outcome? This is often referred to as “pretreatment equivalence.” Other important questions include: Can the treatment cases be chosen randomly, or through some process that approximates random selection? Can the equivalence initially present at the point of intervention between treatment and control groups be maintained over the life of the study (i.e., over whatever time is relevant to observe the putative causal effects)? This may be referred to as “posttreatment equivalence.” 5. Temporal equivalence (between pre- and posttests). Causal attri- bution works by comparing spatially and/or temporally. This is usually done through pre- and posttreatment tests (i.e., measurements of the outcome before and after the intervention, creating two groups, the pre-

OCR for page 119
 METHODOLOGIES OF IMPACT EVOLUTION intervention group and the postintervention group. Of course, it is the same case, or set of cases, observed at two points in time. However, such comparisons (in isolation from spatial controls) are useful only when the case(s) are equivalent in all respects that might affect the outcome (except, of course, insofar as the treatment itself). More specifically, this means that (1) the effects of the intervention on the case(s) are not obscured by confounders, which are other factors occurring at roughly the same time as the intervention which might affect the outcome, and (2) the outcome under investigation either is stable or has a stable trend (so that the effect of the intervention, if any, can be observed). Note that when there is a good spatial control these issues are less important. By contrast, when there is no spatial control, they become absolutely essential to the task of causal attribution. For temporal control the key questions to ask are: Are comparable pre- and posttests possible? Is it possible to collect data for a longer period of time so that, rather than just two data points, one can construct a longer time series? Are there trends in the outcome that must be taken into account? If trends are present, are they fairly stable? Can we anticipate that this stability will be retained over the course of the research (in the absence of any intervention)? Is the intervention correlated (tem- porally) with other changes that might obscure causal attribution? External validity External alidity is the generalizability of the project beyond a single case. To provide policymakers at USAID with relevant information, the results of a project evaluation should be generalizable; that is, they must be true (or plausibly true) beyond the case under study. Recall that we understand that impact ealuation (as opposed to project monitoring) will most likely be an occasional event applied to a set of the most important and most frequently used projects, not one routinely undertaken for all projects. This means that the value of the evaluation is to be found in the guidance it may offer policymakers in designing projects and allocating funds over the long term and across the whole spectrum of countries in which USAID works. There will always be questions about how much one can generalize about the impact of a project. The fact that a project worked in one place, at one time, may or may not indicate its possible success in other places and at other times. The committee recognizes that the design of USAID projects and the allocation of funds are a learning process and the politi- cal situation and opportunities for intervention in any given country are a moving target. Even so, project officers must build on what they know, and this knowledge is largely based on the experiences of projects that are currently in operation around the world. Some projects are perceived

OCR for page 119
0 IMPROVING DEMOCRACY ASSISTANCE (the treatment) but only across a very small set of cases. In this case it is not possible to use probability tests derived from statistical theory to gauge the causal impact of an experiment across groups where the treat- ment and control groups each have only one or several members or where there is no control whatsoever. However, in other respects the challenges posed by, and advantages accrued from, this sort of analysis are quite similar to the large N randomized design. Where cross-unit variance is minimal (by reason of the limited num- ber of units at one’s disposal), the emphasis of the analysis necessarily shifts from spatial evidence (across units) to evidence garnered from temporal variation (i.e., to a comparison of pre- and posttests in the treated units). Naturally, one wants to maximize the number of treated units and the number of untreated controls. This can be achieved by a modified “rollout” protocol. Note that in a large N randomized setting (as described above), the purpose of rollout procedures is usually (1) to test a complex treatment (e.g., where multiple treatments or combinations of treatments are being tested in a single research design) or (2) for purposes of distributing a valued good among the population while preserving a control group. The most crucial issue is to maximize useful variation on the available units. This can be achieved by testing each unit in a serial fashion, regarding the remaining (untreated) units as controls. Consider a treatment that is to be administered across six regions of a country. There are only six regions, so cross-unit variation is extremely lim- ited. To make the most of this evidence-constrained setting, the researcher may choose to implement five iterations of the same manipulated treat- ment, separated by some period of time (e.g., one year). During all stages of analysis, there remains at least one unit that can be regarded as a con- trol. This style of rollout provides five pre- and posttests and a continual (albeit shrinking) set of controls. As long as contamination effects are not severe, the results from this sort of design may be more easily interpreted than the results from a simple split-sample research design (i.e., treating three regions and retaining the others as a control group). In the latter any observed variation across treatment and control groups may be due to a confounding factor that coincides temporally and correlates spatially with the intervention. Despite the randomized nature of this intervention, it is still quite possible that other matters beyond the control of the investigator may intercede. It is not always possible to tell whether or not confounding factors are present in one or more of the cases. In a large N setting, we can be more confident that such confounding factors, if present, will be equally distributed across treatment and control groups. Not so for the small N setting. This is all the more reason to try to maximize experimen- tal leverage by setting in motion a rollout procedure that treats each unit

OCR for page 119
 METHODOLOGIES OF IMPACT EVOLUTION separately through time. Any treatment effects that are fairly consistent across the six cases are unlikely to be the result of confounding factors and are therefore interpretable as causal rather than spurious. Note that in a small population where all units are being treated, it is likely that there will be significant problems of contamination across units. In the scenario discussed above, for example, it is likely that untreated regions in a country will be aware of interventions implemented in other regions. Thus it is advisable to devise case selection and implementation procedures that minimize potential contamination effects. For example, in the rollout protocol discussed above, one might begin by treating regions that are most isolated, leaving the capital region for last. Regardless of the procedure for case selection, it will be important for researchers to pay close attention to potential changes before and after the treatment is administered. That is, in small N randomization designs, it is highly advisable to collect baseline data since the comparison groups are less likely to be similar enough to compare directly. In an example of a small N randomized evaluation, Glewwe et al (2007) used a very modest sample of 25 randomly chosen schools to evaluate the effect of the provision of textbooks on student test scores. A Dutch nonprofit organization provided textbooks to 25 rural Kenyan primary schools chosen randomly from a group of 100 candidate schools. The authors found no evidence that the project increased average scores, reduced grade repetition, or affected dropout rates (although they did find that the project increased the scores of the top two quintiles of those with the highest preintervention academic achievement). Evidently, sim- ply providing the textbooks only helped those who were already the most motivated or accomplished; in the absence of other changes (e.g., better attendance, more prepared or involved teachers), the books alone pro- duced little or no change in average students’ achievement. It is important to note that, like other forms of impact evaluation, this study required good baseline data to conduct its evaluation. Small N Comparison 4. In small N designs USAID may be unable to manipulate the temporal or spatial distribution of the treatment. In this context the evaluator faces the additional hurdle of not having sufficient cases to employ statistical procedures to correct for the biases that make identifying causal effects difficult when treatments cannot be manipulated. Nonetheless, there are still advantages to identifying units that will not be treated and gathering pre- and posttreatment measures of out- comes in both the treatment and control groups. A control group is useful here for (1) ruling out the possibility that the intervention coincided with a temporal change or trend that might account for observed changes in

OCR for page 119
 IMPROVING DEMOCRACY ASSISTANCE the treatment group and (2) ensuring that application of the treatment was not correlated with other characteristics of the treated units that could explain observed differences between the treatment and control groups. Ideally, the control group in a small N comparison should be matched to the treatment group as precisely as possible. With large amounts of data, propensity score matching techniques can be used to identify a control group that approximates the treated units across a range of observables. When data are not widely available, a control group can be generated qualitatively by identifying untreated units that are similar to those in the treatment group on key dimensions (other than the treatment) that might affect the outcomes of interest. 5. N = 1 Study with USAID Control over Timing and Location of Treatment Sometimes, there is no possibility of spatial comparison. This is often the case where the unit of concern exists only at a national level (e.g., an electoral administration body), and nearby nation-states do not offer the promise of pre- or posttreatment equivalence. In this case the researcher is forced to reach causal inferences on the basis of a single case. Even so, the possibility of a manipulated treatment offers distinct advantages over the unmanipulated (observed) treatment. The ability to choose the timing of the intervention and plan observations to maximize the likelihood of accurate inferences can provide considerable leverage for credible con- clusions. However, these advantages accrue only if very careful attention is paid to the timing of the intervention, the nature of the intervention, its anticipated causal effect, and the pre- and posttreatment evidence that might be available. The challenge here is to overcome the problems that are already highlighted here with regard to simple before and after comparisons. First, with respect to timing, it is essential that the intervention occur during a period in which no other potentially contaminating factors are at work and in which the outcome factors being observed would be expected to be relatively stable; that is, a constant trend is expected, so that any changes in that trend are easily interpreted. Naturally, these matters lie partly in the future and therefore cannot always be anticipated. Nonethe- less, the delicacy of this research design—its extreme sensitivity to any violation of ceteris paribus assumptions—requires the researcher to antici- pate what may occur, at least through the duration of the experiment. Second, with respect to detrending the data, it is helpful if the researcher can gather information on the outcome(s) of interest and any potential confounders for the periods before and after the intervention. The longer the period of observation, the more confident one can be about any causal inference made (Campbell 1968/1988). Thus, if the outcome

OCR for page 119
 METHODOLOGIES OF IMPACT EVOLUTION factor being studied has been stable for a long time before the interven- tion, and other factors likely to have an impact on the outcome have been ruled out, one can have more confidence that any observed change in the trend was due to the intervention. Third, with respect to the intervention itself, it is essential that it be discrete and significant enough to be easily observed. While subtle project effects may be detected in a large N randomized design, usually only very large effects can be confidently observed in a single-case setting. Fourth, it is helpful if the intervention has more than one observable (and policy-significant) effect. This goes some way toward resolving the ever-present threats of measurement error and confounding causes. If, for example, a given intervention is expected to produce changes in three measurable independent outcomes, and all three factors change in the aftermath of an intervention, it is less likely that the noted association is spurious. N = 1 Comparison 6. When the unit of concern exists only at the national level and the treatment cannot be manipulated by USAID, discerning causal effects is extraordinarily difficult. Observed differences in outcome measures pre- and posttreatment can be interpreted as causal effects only if the evaluator can make the case that other factors were not important. Some of the strategies described above are applicable in an N = 1 comparison if the treatment can be interpreted “as if” it was manipulated (e.g., Miron 1994). Any demonstration of a large discontinuous change in an outcome of interest following the treatment increases confidence in the causal interpretation of the effect. This requires an effort to measure the outcome(s) of interest prior to, and after, the intervention. In some cases it may be possible to identify units for comparison within the country or outside the country, in order to rule out obvious temporal confounds. Take the example of an anticorruption effort funded in a specific ministry. If it can be shown that corruption levels remained unchanged in untreated ministries while shifting dramatically in a treated ministry, we gain confidence that a government-wide anticorruption effort cannot account for the effects observed in the treated ministry. But the possibility cannot be ruled out that other developments in the treated ministry (such as good leadership) are more important than the interven- tion in accounting for the outcome. Or take the example of a national anticorruption effort that is rolled out in one country but not in adjacent countries or at different times in adjacent countries. Changes in outcome variables in the other countries could be tracked to seek the effects of the program; if reductions in corruption occur to a greater degree, or in a timed sequence that corresponds to the timing of roll-outs in different

OCR for page 119
 IMPROVING DEMOCRACY ASSISTANCE countries, one can have confidence that it is not regional or global trends that were driving the reductions in corruption. On the other hand, as in the previous example, the possibility could not be ruled out that other fac- tors, such as freer media or stronger leadership, were the key causal fac- tors in reducing corruption rather than the specific USAID project, unless there were also measures of those possible confounding factors. Not all USAID DG programs need to be subjected to rigorous impact evaluation. For example, if USAID is working to help a country pass a new constitution with certain human rights provisions, and several other NGOs and foreign countries are also working to that end, it may not matter how much USAID’s specific activities contributed to a success- ful outcome; success is what matters and credit can be shared among all who contributed. (On the other hand, a subsequent impact evaluation of whether the new constitution actually resulted in an improvement in human rights—an N = 1 comparison designed to plot changes in human rights violations over time and look for sharp reductions following adop- tion of the new constitution—may be worthwhile.) In particular, the random assignment mode of impact evaluation is probably best used only where the fair assignment of assistance naturally results in a randomized assignment of aid or where USAID uses a proj- ect in so many places, or invests so much in a project, that it is of great importance to be confident of that project’s effectiveness. In most settings, worthwhile insights into project impacts can be derived from designs that include small N comparisons, as long as good baseline, outcome, and comparison group data are collected. ExAMPLES OF THE USE OF RANDOMIzED EvALUATIONS IN IMPACT EvALUATIONS OF DEvELOPMENT ASSISTANCE (INCLUDINg Dg PROJECTS) Randomized designs have a high degree of internal validity. By per- mitting a comparison of outcomes in a treatment group and a control group that can be considered identical to one another, they do a better job than any other evaluation technique of permitting evaluators to identify the impact of a given intervention. It is no surprise, therefore, that ran- domized evaluation is the industry standard for the assessment of new medications. It is inconceivable that a pharmaceutical company would be permitted to introduce a new medication into the market unless evidence from a randomized evaluation proved its benefits. Yet as discussed in Chapter 2, for the assessment of DG assistance programs, impact evalua- tions have rarely been employed. This leaves USAID in the difficult posi- tion of spending hundreds of millions of dollars on assistance programs without proven effects.

OCR for page 119
 METHODOLOGIES OF IMPACT EVOLUTION There are a small, but important, number of large N randomized impact evaluations that have been carried out to test the effects of assis- tance programs. Classic evaluations, such as the RAND health insurance study and the evaluation of the Job Training Partnership Act (JTPA), stand out as exemplars of large-scale assessments of social assistance programs (Wilson 1998, Gueron and Hamilton 2002, Newhouse 2004). A few have been done in developing countries; the evaluation of Mexico’s conditional cash transfer program, Progresa/Oportunidades, continues to shape the design of similar programs in other contexts (Morley and Coady 2003). The number of such evaluations is growing. In fields as diverse as public health, education, microfinance, and agricultural development, randomized evaluations are increasingly employed to assess project effec- tiveness. Examples abound in the field of public health: Studies have assessed the efficacy of male circumcision in combating HIV (Auvert et al 2005), the impact of HIV prevention programs on sexual behavior (Dupas 2007), the effectiveness of bed nets for reducing the incidence of malaria (Nevill et al 1996), the impact of deworming drugs on health and educa- tional achievement (Miguel and Kremer 2004), and the role of investments in clean water technologies on health outcomes (Kremer et al 2006). In education, randomized evaluations have been used to explore the efficacy of conditional cash transfers (Schultz 2004), school meals (Vermeersch and Kremer 2004), and school uniforms and textbooks (Kremer 2003) on school enrollment; the effectiveness of additional inputs, such as teacher aids, on school performance (Banerjee and Kremer 2002); and the impact of school reforms, such as voucher programs, on academic achievement (Angrist et al 2006). In microfinance, attention has focused on the impact of programs on household welfare (Murdoch 2005); randomized evalua- tions in agricultural development are exploring the benefits and impedi- ments to the adoption of new technologies, such as hybrid seeds and fertilizer (Duflo et al 2006a). Thus far, however, these approaches have not been applied to the evaluation of DG programs. A significant part of the explanation for this is that it is often more difficult to measure outcomes in the area of demo- cratic governance. Most successful randomized evaluations have been conducted in areas such as health and education, where it is much more straightforward to measure outcomes. For example, the presence of intes- tinal parasites can be measured quite easily and accurately via stool sam- ples (as in Miguel and Kremer 2004); water quality can be assessed via a test for E. coli. content (as in Kremer et al 2006); nutritional improvements can be traced quite readily via height and weight measures; school perfor- mance or learning can be tracked easily via test scores (as in Banerjee et al 2007); and teacher absenteeism can be measured with attendance records (as in Banerjee and Duflo 2006). Developing valid and reliable measures

OCR for page 119
 IMPROVING DEMOCRACY ASSISTANCE of the outcomes targeted by DG programs is much more difficult and stands as an important challenge for project evaluation in this area. The challenge is not insurmountable; there have been tremendous improve- ments over the past decade in the measurement of political participation and attitudes (Bratton et al 2005), social capital and trust (Grootaert et al 2004), and corruption (Bertrand et al 2007, Olken 2007). And as discussed in Chapter 2, USAID has made significant efforts to develop outcome indicators to support its project M&E work. This chapter closes with two examples of impact evaluations using randomized designs applied to DG subjects that tested commonly held programming assumptions. The first addresses the issue of corruption. USAID invests significant resources every year in anticorruption initia- tives, but questions remain about the efficacy of such investments. Which programs yield the biggest impact in terms of reducing corruption? Some have argued that corruption can be reduced with the right combination of monitoring and incentives provided from above (Becker and Stigler 1974). Of course, the challenge with top-down monitoring is that higher level officials may themselves be corruptible. An alternative approach has emphasized local-level monitoring (World Bank 2004). The argument is that community members have the strongest incentives to police the behavior of local officials, as they stand to benefit the most from local public goods provision. Yet this strategy also has its drawbacks: Individu- als may not want to bear the costs of providing oversight, preferring to leave that to others, or community members may be easily bought off by those engaged in corrupt practices. Which strategy most effectively reduces corruption? Olken (2007) set out to answer this question in Indonesia through a unique partnership with the World Bank. As a nationwide village-level road-building project was rolled out, Olken randomly selected one set of villages to be subject to an external audit by the central government, a second set in which extensive efforts were made to mobilize villagers to participate in oversight and accountability meetings, a third set in which the accountability meetings were complemented by an anonymous mechanism for raising complaints about corruption in the project, and a fourth set reserved as a control group. To measure the efficacy of these different strategies, Olken constructed a direct measure of corruption: He assembled a team of engineers and surveyors who, after the projects were completed, dug core samples in each road to estimate the quantity of materials used, interviewed villagers to determine the wages paid, and surveyed suppliers to estimate local prices to construct an independent estimate of the cost of the project. The difference between the reported expenditures by the village and this independent estimate provides a direct measure of corruption. His findings strongly suggest the efficacy of

OCR for page 119
 METHODOLOGIES OF IMPACT EVOLUTION external audits: Missing expenditures were eight percentage points lower in villages subject to external monitoring. The results were less impressive for grassroots monitoring. While community members did participate in the accountability meetings in higher numbers in villages where special mobilization efforts were undertaken and they did discuss corruption- related problems (and even took action at times), no significant reductions in the level of corruption were observed. If one had relied on only the out- put measures or observation that characterizes many USAID M&E efforts (e.g., number of participants in community events supported by USAID programs), it might have mistakenly been concluded from the level of community participation that grassroots monitoring was making a sub- stantial difference. But Olken’s more careful methodology led him to the opposite conclusion. While there are undoubtedly benefits to mobilizing community participation for a variety of other purposes, it appears that if the goal is to reduce local corruption, supporting more external audits is considerably more effective. Another example is the question of how best to promote a robust and vibrant civil society. USAID regularly makes substantial investments in civil society organizations (CSOs) and local NGOs with the hope of empowering the disadvantaged, building trust, enhancing cooperation, and supporting the flourishing of democratic institutions (Putnam 1993, 2000). Yet some skeptics have warned that outside support for CSOs might be counterproductive: It may produce more professionally run organiza- tions that no longer have strong ties to their grassroots base (Skocpol 2003) and may actually change the leadership of such organizations, disempow- ering the disadvantaged (Igoe 2003). Knowing whether outside assistance helps or harms CSOs is a question of vital importance, and randomized evaluations have begun to offer some preliminary evidence. Gugerty and Kremer (2006) conducted a randomized evaluation in which a sample of women’s self-help associations in rural Western Kenya were randomly selected to receive a package of assistance that included organizational and management training as well as a set of valuable agri- cultural inputs such as tools, seeds, and fertilizer. Forty groups received assistance in the first year, while an additional 40 eligible groups served as the control group (although they were given the same assistance, just two years later). The results are disturbing for advocates of outside funding to community groups. While members of the funded groups reported higher levels of satisfaction with their group leadership, there is little evidence that objective measures of group activity improved. Moreover, Gugerty and Kremer found that outside funding changed the nature of the group and its leadership. Younger, more educated women and women from the formal sector increasingly joined the group, and these new entrants tended to assume leadership positions and to displace older women.

OCR for page 119
 IMPROVING DEMOCRACY ASSISTANCE Compared to their unfunded counterparts, funded groups experienced a two-thirds increase in the exit rate of older women—a troubling finding given the program’s underlying objective of empowering the disempow- ered. Whereas an analysis of group members’ satisfaction would have led project evaluators to conclude that the project was a success, the careful randomized design led Gugerty and Kremer to the opposite conclusion (and generated significant evidence that the skeptics may be right about the sometimes counterproductive impact of donor funding to CSOs). These two examples serve to support a broader point: It is both pos- sible and important to conduct randomized impact evaluations of projects designed to support DG. In both cases the randomized evaluations effec- tively measured a project’s impact, but they also provided new evidence about implicit hypotheses that guide programming more broadly. In the case of corruption, the implicit hypothesis was that community empow- erment is an antidote to local-level corruption; in the case of civil society support, the hypothesis was that donors can spur the growth of a vibrant civil society that empowers the disadvantaged through outside support. The evidence casts some doubt on both hypotheses and should encourage further evaluations to see if these results hold more broadly and perhaps fuel the search for alternative methods to support DG goals. The larger point, however, is not so much the findings of these stud- ies as the fact that they were successfully conducted on DG projects. The next chapter describes the findings of the committee’s field studies and discusses how these designs could be applied to the evaluation of several of USAID’s own current DG projects. It also explicitly addresses some of the common objections to using randomized evaluations more widely. REFERENCES Angrist, J.D., and Lavy, V. 1999. Using Maimonides’ Rule to Estimate the Effect of Class Size on Scholastic Achievement. Quarterly Journal of Economics 114:533-575. Angrist, J.D., Bettinger, E., and Kremer, M. 2006. Long-Term Consequences of Secondary School Vouchers: Evidence from Administrative Records in Colombia. American Eco- nomic Reiew 96:847-862. Auvert, B., Taljaard, D., Lagarde, E., Sobngwi-Tambekou, J., Sitta, R., and Puren, A. 2005. Randomized, Controlled Intervention Trial of Male Circumcision for Reduction of HIV Infection Risk: The ANRS 1265 Trial. PLOS Medicine. Available at: http://medicine. plosjournals.org/perlser/?request=get-document&doi=0./journal.pmed.000&ct=. Accessed on February 23, 2008. Banerjee, A.V., and Duflo, E. 2006. Addressing Absence. Journal of Economic Perspecties 20(1):17-132. Banerjee, A.V., and Kremer, M. 2002. Teacher-Student Ratios and School Performance in Udaipur, India: A Prospective Evaluation. Washington, DC: Brookings Institution. Banerjee, A.V., Cole, S., Dutlo, E., and Linden, L. 2007. Remedying Education: Evidence from Two Randomized Experiments in India. Quarterly Journal of Economics 122(3):1235- 1264.

OCR for page 119
 METHODOLOGIES OF IMPACT EVOLUTION Becker, G.S., and Stigler, G.J. 1974. Law Enforcement, Malfeasance, and the Compensation of Enforcers. Journal of Legal Studies 3:1-19. Bertrand, M., Duflo, E., and Mullainathan, S. 2004. How Much Should We Trust Difference- In-Difference Estimates? Quarterly Journal of Economics 119(1):249-275. Bertrand, M., Djankov, S., Hanna, R., and Mullainathan, S. 2007. Obtaining a Driving Li- cense in India: An Experimental Approach to Studying Corruption. Quarterly Journal of Economics 122:1639-1676. Bloom, H.S., ed. 2005. Learning More from Social Experiments: Eoling Analytic Approaches. New York: Russell Sage Foundation. Bollen, K., Paxton, P., and Morishima, R. 2005. Assessing International Evaluations: An Example from USAID’s Democracy and Governance Programs. American Journal of Ealuation 26:189-203. Bratton, M., Mattes, R., and Gyimah-Boadi, E. 2005. Public Opinion, Democracy, and Market Reform in Africa. New York: Cambridge University Press. Campbell, D.T. 1968/1988. The Connecticut Crackdown on Speeding: Time-Series Data in Quasi-Experimental Analysis. Pp. 222-238 in Methodology and Epistemology for Social Science, E.S. Overman, ed. Chicago: University of Chicago Press. Devine, T.J., and Heckman, J.J. 1996. The Economics of Eligibility Rules for a Social Program: A Study of the Job Training Partnership Act (JTPA)—A Summary Report. Canadian Journal of Economics 29(Special Issue: Part 1):S99-S104. Duflo, E. 2000. Schooling and Labor Market Consequences for School Construction in Indo- nesia. Cambridge, MA: MIT Dept. of Economics Working Paper No. 00-06. Duflo, E., Kremer, M., and Robinson, J. 2006a. Understanding Technology Adoption: Fertil- izer in Western Kenya. Evidence from Field Experiments. Available at: http://www.econ. berkeley.edu/users/webfac/saez/e_s0/esther.pdf. Accessed February 23, 2008. Duflo, E., Glennerster, R., and Kremer, M. 2006b. Using Randomization in Development Eco- nomics Research: A Toolkit. Unpublished paper. Massachusetts Institute of Technology and Abdul Latif Jameel Poverty Action Lab. Cambridge, MA. Dupas, P. 2007. Relative Risks and the Market for Sex: Teenagers, Sugar Daddies, and HIV in Kenya. Available at: http://www.dartmouth.edu/~pascaline/. Accessed on February 23, 2008. Gerring, J. 2007. Case Study Research: Principles and Practices. Cambridge: Cambridge Uni- versity Press. Gerring, J., and McDermott, R. 2007. An Experimental Template for Case-Study Research. American Journal of Political Science 51:688-701. Glewwe, P., Kremer, M., and Moulin, S. 2007. Many Children Left Behind? Textbooks and Test Scores in Kenya. NBER Working Paper No. 13300. Available at: http://papers/nber. org/papers/w00. Accessed on April 26, 2008. Gueron, J.M., and Hamilton, G. 2002. The Role of Education and Training in Welfare Reform. Policy Brief No. 20. Washington, DC: Brookings Institution. Grootaert, C., Narayan, D., Jones, V.N., and Woolcock, M. 2004. Measuring Social Capital: An Integrated Questionnaire. World Bank Working Paper No. 18. Washington, DC: The World Bank. Gugerty, M.K., and Kremer, M. 2006. Outside Funding and the Dynamics of Participation in Community Associations. Background Paper. Washington, DC: World Bank. Available at http://siteresources.worldbank.org/INTPA/Resources/Training-Materials/OutsideFunding. pdf. Accessed on April 26, 2008. Hahn, J., Todd, P., and Van der Klaauw, W. 2001. Identification and Estimation of Treatment Effects with a Regression-Discontinuity Design. Econometrica 69(1):201-209. Heckman, J. 1997. Instrumental Variables: A Study of Implicit Behavioral Assumptions Used in Making Program Evaluations. Journal of Human Resources 32(3):441-462.

OCR for page 119
0 IMPROVING DEMOCRACY ASSISTANCE Heckman, J., Ichimura, H., and Todd, P. 1997. Matching as an Econometric Evaluation Es- timator: Evidence from Evaluating a Job Training Program. Reiew of Economic Studies 64:605-654. Hyde, S. 2006. The Observer Effect in International Politics: Evidence from a Natural Experi- ment. Unpublished paper. Yale University. Igoe, J. 2003. Scaling Up Civil Society: Donor Money, NGOs and the Pastoralist Land Rights Movement in Tanzania. Deelopment and Change 34(5):863-885. Kremer, M. 2003. Randomized Evaluations of Educational Programs in Developing Coun- tries: Some Lessons. American Economic Reiew 93(2):102-106. Kremer, M., Leino, J., Miguel, E., and Zwane, A. 2006. Spring Cleaning: A Randomized Evaluation of Source Water Improvement. Available at http://economics.harard.edu/ faculty/Kremer/files/springclean.pdf. Accessed on April 26, 2008. Meyer, B.D., Viscusi, W.K., and Durbin, D.L. 1995. Workers’ Compensation and Injury Dura- tion: Evidence from a Natural Experiment. American Economic Reiew 85(3):322-340. Miguel, E., and Kremer, M. 2004. Worms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities. Econometrica 72(1):159-217. Miron, J.A. 1994. Empirical Methodology in Macroeconomics: Explaining the Success of Friedman and Schwartz’s “A Monetary History of the United States, 1867-1960.” Journal of Monetary Economics 34:17-25. Morley, S., and Coady, D. 2003. Targeted Education Subsidies in Deeloping Countries: A Reiew of Recent Experiences. Washington, DC: Center for Global Development. Murdoch, J. 2005. The Economics of Microfinance. Cambridge, MA: MIT Press. Nevill, C.G., Some, E.S., Mung’ala, V.O., Mutemi, W., New, L., Marsh, K., Lengeler, C., and Snow, R.W. 1996. Insecticide-Treated Bednets Reduce Mortality and Severe Morbidity from Malaria Among Children on the Kenyan Coast. Tropical Medicine and International Health 1:139-146. Newhouse, J.P. 2004. Consumer-Directed Health Plans and the RAND Health Insurance Experiment. Health Affairs 23(6):107-113. Olken, B.A. 2007. Monitoring Corruption: Evidence from a Field Experiment in Indonesia. Journal of Political Economy 115:200-249. Putnam, R. 1993. Making Democracy Work: Ciic Traditions in Modern Italy. Princeton, NJ: Princeton University Press. Putnam, R. 2000. Bowling Alone. New York: Simon and Schuster. Schultz, T.P. 2004. School Subsidies for the Poor: Evaluating the Mexican PROGRESA Pov- erty Program. Journal of Deelopment Economics 74(1):199-250. Senge, P. 2006. The Fifth Discipline: The Art and Practice of the Learning Organization, Rev. Ed. New York: Doubleday. Skocpol, T. 2003. Diminished Democracy: From Membership to Management in American Civic Life. Julian T. Rothbaum Lecture Series, Vol. 8. Norman: University of Oklahoma Press. Shadish, W.R., Cook, T.D., and Campbell, D.T. 2002. Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston: Houghton-Mifflin. Trochim, W., and Donnelly, J.P. 2007. The Research Methods Knowledge Base, 3rd ed. Cincinnati, OH: Atomic Dog Publishing. Vermeersch, C., and Kremer, M. 2004. School Meals, Educational Achievement, and School Competition: Evidence from a Randomized Evaluation. World Bank Policy Research Working Paper No. 3523. Washington, DC: The World Bank. Wholey, J.S., Hatry, H.P., and Newcomer, K.E., eds. 2004. Handbook of Practical Program Evaluation, 2nd ed. San Francisco: Jossey-Bass. Wilson, E.O. 1998. Consilience: The Unity of Knowledge. New York: Knopf. World Bank. 2004. Monitoring & Ealuation: Some Tools, Methods & Approaches. Washington, DC: International Bank for Reconstruction, The World Bank.