Read "Improving Evaluation of Anticrime Programs" at NAP.edu

Page 34 Cite

Suggested Citation:"4 How Should an Impact Evaluation Be Designed?." National Research Council. 2005. Improving Evaluation of Anticrime Programs. Washington, DC: The National Academies Press. doi: 10.17226/11337.

×

4
How Should an Impact Evaluation Be Designed?

Assuming that a criminal justice program is evaluable and animpact evaluation is feasible, an appropriate research design must be developed. The basic idea of an impact evaluation is simple. Program outcomes are measured and compared to the outcomes that would have resulted in the absence of the program. In practice, however, it is difficult to design a credible evaluation study in which such a comparison can be made. The fundamental difficulty is that whereas the program being evaluated is operational and its outcomes are observable, at least in principle, the outcomes in the absence of the program are counterfactual and not observable. This situation requires that the design provide some basis for constructing a credible estimate of the outcomes for the counterfactual conditions.

Another fundamental characteristic of impact evaluation is that the design must be tailored to the circumstances of the particular program being evaluated, the nature of its target population, the outcomes of interest, the data available, and the constraints on collecting new data. As a result, it is difficult to define a “best” design for impact evaluation a priori. Rather, the issue is one of determining the best design for a particular program under the particular conditions presented to the researcher when the evaluation is undertaken. This feature of impact evaluation has significant implications for how such research should be designed and also for how the quality of the design should be evaluated.

Page 35 Cite

Suggested Citation:"4 How Should an Impact Evaluation Be Designed?." National Research Council. 2005. Improving Evaluation of Anticrime Programs. Washington, DC: The National Academies Press. doi: 10.17226/11337.

×

THE REPERTOIRE OF RELEVANT RESEARCH DESIGNS

Establishing credible estimates of what the outcomes would have been without the program, all else equal, is the most demanding part of impact evaluation, but also the most critical. When those estimates are convincing, the effects found in the evaluation can be attributed to the program rather than to any of the many other possible influences on the outcome variables. In this case, the evaluation is considered to have high internal validity. For example, a simple comparison of recidivism rates for those sentenced to prison and those not sentenced would have low internal validity for estimating the effect of prison on reoffending. Any differences in recidivism outcomes could easily be due to preexisting differences between the groups. Judges are more likely to sentence offenders to prison who have serious prior records. Prisoners’ greater recidivism rates may not be the result of their prison experience but, rather, the fact that they are more serious offenders in the first place. The job of a good impact evaluation design is to neutralize or rule out such threats to the internal validity of a study.

Although numerous research designs are used to assess program effects, it is useful to classify them into three broad categories: randomized experiments, quasi-experiments, and observational designs. Each, under optimal circumstances, can provide a valid answer to the question of whether a program has an effect upon the outcomes of interest. However, these designs differ in the assumptions they make, the nature of the problems that undermine those assumptions, the degree of control the researcher must have over program exposure, the way in which they are implemented, the issues encountered in statistical analysis, and in many other ways as well. As a result, it is difficult to make simplistic generalizations about which is the best method for obtaining a valid estimate of the effect of any given intervention. We return to this issue later but first provide an overview of the nature of each of these types of designs.

RANDOMIZED EXPERIMENTS

In randomized experiments, the units toward which program services are directed (usually people or places) are randomly assigned to receive the program or not (intervention and control conditions, respectively). For example, in the Minneapolis Hot Spots Experiment (Sherman and Weisburd, 1995), 110 crime hot spots were randomly allocated to an experimental condition that received high levels of preventive patrol and a control condition with a lower “business as usual” level of patrol. The researchers found a moderate, statistically significant program effect on crime rates. Because the hot spots were assigned by a chance process that

Page 36 Cite

Suggested Citation:"4 How Should an Impact Evaluation Be Designed?." National Research Council. 2005. Improving Evaluation of Anticrime Programs. Washington, DC: The National Academies Press. doi: 10.17226/11337.

×

took no account of their individual characteristics, the researchers could assume that there were no systematic differences between them other than the level of policing. The differences found on the outcome measures, therefore, could be convincingly interpreted as intervention effects.

The main threat to the internal validity of the randomized experiment is attrition prior to outcome measurement that degrades the randomized groups. In the randomized experiment reported by Berk (2003), offenders were randomly assigned to one of several correctional facilities that used different inmate classification systems. The internal validity of this study would have been compromised if a relatively large proportion of those offenders then left those facilities too quickly to establish the misconduct records that provided the outcome measures, e.g., through unexpected early release or transfers to other facilities. Such attrition cannot automatically be assumed to be random nor unrelated to the characteristics of the respective facilities, thus it degrades the statistical equivalence between the groups that was established by the initial randomization. In the prison settings studied by Berk, low rates of attrition were achieved, but this is not always the case. In many randomized experiments conducted in criminal justice research, attrition is a significant problem.

QUASI-EXPERIMENTS

Quasi-experiments are approximations to randomized experiments that compare selected cases receiving an intervention with selected cases not receiving it, but without random assignment to those conditions (Cook and Campbell, 1979). Quasi-experiments generally fall into three classes. In the most common type, an intervention group is compared with a control group that has been selected on the basis of similarity to the intervention group, a specific selection variable, or perhaps simply convenience. For example, researchers might compare offenders receiving intensive probation supervision with offenders receiving regular probation supervision that is matched on prior offense history, gender, and age. The design of this type that is least vulnerable to internal validity threats is the regression-discontinuity or cutting-point design (Shadish, Cook, and Campbell, 2002). In this design, assignment to intervention and control conditions is made on the basis of scores on an initial measure, e.g., a pretest or risk variable. For example, drug offenders might be assigned to probation if their score on a risk assessment was below a set cut point and to drug court if it was above that cut point. The effects of drug court on subsequent substance use will appear as a discontinuity in the statistical relationship between the risk score and the substance use outcome variable.

A second type of quasi-experiment is the time-series design. This

Page 37 Cite

Suggested Citation:"4 How Should an Impact Evaluation Be Designed?." National Research Council. 2005. Improving Evaluation of Anticrime Programs. Washington, DC: The National Academies Press. doi: 10.17226/11337.

×

design uses a series of observations on the outcome measure made before the program begins that is then compared with another series made afterward. Thus, researchers might compare traffic accidents per month for the year before a speeding crackdown and the year afterward. Because of the requirement for repeated measures prior to the onset of the intervention, time-series designs are most often used when the outcome variables of interest are available from data archives or public records. The third type of quasi-experiment combines nonrandomized comparison groups with time-series observations, contrasting time series for conditions with and without the program. In this design the researcher might compare traffic accidents before and after a speeding crackdown with comparable time-series data from a similar area in which there was no crackdown. This kind of comparison is sometimes referred to as the difference-in-difference method since the pre-post differences in outcomes for the intervention conditions are compared to the pre-post differences in the comparison condition. Ludwig and Cook (2000), for instance, evaluated the impact of the 1994 Brady act by comparing homicide and suicide rates from 1985 to 1997 in 32 states directly affected by the act with those in 19 states that had equivalent legislation already in place.

Quasi-experimental designs are more vulnerable than randomized designs to influences from sources other than the program that can bias the estimates of effects. The better versions of these designs attempt to statistically account for such extraneous influences. To do that, however, requires that the influences be recognized and understood and that data relevant to dealing with them statistically be available. The greatest threat to the internal validity of quasi-experimental designs, therefore, is usually uncontrolled extraneous influences that have differential effects on the outcome variables that are confounded with the true program effects. Simply stated, the equivalence that one can assume from random allocation of subjects into intervention and control conditions cannot be assumed when allocation into groups is not random. Moreover, these designs, like experimental designs, are vulnerable to attrition after the intervention has begun.

OBSERVATIONAL DESIGNS

The third type of design used for evaluation of crime and justice programs is an observational design. Strictly speaking, all quasi-experiments are observational designs, but we will use this category to differentiate studies that observe natural variation in exposure to the program and model its relationship to variation in the outcome measures with other influences statistically controlled. For example, Ayres and Levitt (1998) examined the effects of Lojack, a device used to retrieve stolen vehicles,

Page 38 Cite

Suggested Citation:"4 How Should an Impact Evaluation Be Designed?." National Research Council. 2005. Improving Evaluation of Anticrime Programs. Washington, DC: The National Academies Press. doi: 10.17226/11337.

×

on city auto theft rates. They drew their data from official records in cities that varied in the prevalence of Lojack users. Because many factors besides use of Lojack influence auto theft, they attempted to account for these potential threats to validity by controlling for them in a statistical model. This type of structural model has been used to study the effects of law enforcement on cocaine consumption (Rydell and Everingham, 1994), racial discrimination in policing (Todd, 2003), and other criminal justice interventions.

The major threat to the internal validity of observational designs used for impact evaluation is failure to adequately model the processes influencing variation in the program and the outcomes. This problem is of particular concern in criminal justice evaluations because theoretical development in criminology is less advanced than in disciplines, like economics, that rely heavily on observational modeling (Weisburd, 2003). Observational methods require that the researcher have sufficient understanding of the processes underlying intervention outcomes, and the other influences on those outcomes, to develop an adequate statistical model. Concern about the validity of the strong assumptions often needed to identify intervention effects with such modeling approaches has led to the development of methods for imposing weak assumptions that yield bounds on the estimates of the program effect (Manski, 1995; Manski and Nagin, 1998). An example of this technique is presented below.

Manski and Nagin (1998) illustrated the use of bounding methods in observational models in a study of the impact of sentencing options on the recidivism of juvenile offenders. Exploiting the rich data on juvenile offenders collected by the state of Utah, they assessed the two main sentencing options available to judges: residential and nonresidential sentences. Although offenders sentenced to residential treatment are more likely to recidivate, this association may only reflect the tendency of judges to sentence different types of offenders to residential placements than to non-residential ones.

Several sets of findings clearly revealed how conclusions about sentencing policy vary depending on the assumptions made. Two alternative models of judicial decisions were considered. The outcome optimization model assumes that judges make sentencing decisions that minimize the chance of recidivism. The skimming model assumes that judges sentence high-risk offenders to residential confinement.

In the worst-case analysis where nothing was assumed about sentencing rules or outcomes, only weak conclusions could be drawn about the recidivism implications of the two sentencing options. However, much stronger conclusions were drawn under the judicial decision-making model. If one believes that judges optimize outcomes—that is, choose sentences in an effort to minimize recidivism—the empirical results indicate

Page 39 Cite

Suggested Citation:"4 How Should an Impact Evaluation Be Designed?." National Research Council. 2005. Improving Evaluation of Anticrime Programs. Washington, DC: The National Academies Press. doi: 10.17226/11337.

×

that residential confinement increases recidivism. If one believes that judges skim—that is, assign high-risk offenders to residential treatment—the results suggest the opposite conclusion, namely that residential confinement reduces recidivism.

SELECTING THE DESIGN FOR AN IMPACT EVALUATION

Because high internal validity can be gained in a well-implemented randomized experiment, it is viewed by many researchers as the best method for impact evaluation (Shadish, Cook, and Campbell, 2002). This is also why randomized designs are generally ranked at the top of a hierarchy of designs in crime and justice reviews of “what works” (e.g., Sherman et al., 2002) and why they have been referred to as the “gold standard” for establishing the effects of interventions in fields such as medicine, public health, and psychology. For the evaluation of criminal justice programs, randomized designs have a long history but, nonetheless, have been used much less frequently than observational and quasi-experimental designs.

Whether a hierarchy of methods with randomized designs at the pinnacle should be defined at the outset for evaluation in criminal justice, however, is a contentious issue. The different views on this point do not derive so much from disagreements on the basic properties of the various designs as from different assessments of the trade-offs associated with their application. Different designs are more or less difficult to implement well in different situations and may provide different kinds of information about program effects.

Well-implemented randomized experiments can be expected to yield results with more certain internal validity than quasi-experimental and observational studies. However, randomized experiments require that the program environment be subject to a certain amount of control by the researcher. This may not be permitted in all sites and, as a result, randomized designs are often implemented in selected sites and situations that may not be representative of the full scope of the program being evaluated. In some cases, randomization is not acceptable for political or ethical reasons. There is, for instance, little prospect of random allocation of sentences for serious offenders or legislative actions such as imposition of the death penalty. Randomized designs are also most easily applied to programs that provide services to units such as individuals or groups that are small enough to be assigned in adequate numbers to experimental conditions. For programs implemented in places or jurisdictions rather than with individuals or groups, assigning sufficient numbers of these larger units to experimental conditions may not be feasible. This is not always the case, however. Wagenaar (1999), for instance, randomly assigned 15

Page 40 Cite

Suggested Citation:"4 How Should an Impact Evaluation Be Designed?." National Research Council. 2005. Improving Evaluation of Anticrime Programs. Washington, DC: The National Academies Press. doi: 10.17226/11337.

×

midwestern communities to either a community organizing initiative aimed at changing policies and practices related to youth alcohol access or a control condition.

The advantages of randomized designs are such that it is quite justifiable to favor them for impact evaluation when they are appropriate to the questions at issue and there is a reasonable prospect that they can be implemented well enough to provide credible and useful answers to those questions. In situations where they are not, or cannot, be implemented well, however, they may not be the best choice (Eck, 2002; Pawson and Tilley, 1997) and another design may be more appropriate.

Quasi-experimental and observational designs have particular advantages for investigating program effects in realistic situations and for estimating the effects of other influences on outcomes relative to those produced by the program. For example, the influence of a drug treatment program on drug use may be compared to the effects of marital status or employment. Observational studies are generally less expensive per respondent (Garner and Visher, 2003) and do not require manipulation of experimental conditions. They thus may be able to use larger and more representative samples of the respective target population than those used in randomized designs. Observational studies, therefore, often have strong external validity. When they can also demonstrate good internal validity through plausible modeling assumptions and convincing statistical controls, they have distinct advantages for many evaluation situations. For some situations, such as evaluation of the effects of large-scale policy changes, they are often the only feasible alternative. In criminal justice, however, essential data are often not available and theory is often underdeveloped, which limits the utility of quasi-experimental and observational designs for evaluation purposes.

As this discussion suggests, the choice of a research design for impact evaluation is a complex one that must be based in each case on a careful assessment of the program circumstances, the evaluation questions at issue, practical constraints on the implementation of the research, and the degree to which the assumptions and data requirements of any design can be met. There are often many factors to be weighed in this choice and there are always trade-offs associated with the selection of any approach to conducting an impact evaluation in the real world of criminal justice programs. These circumstances require careful deliberation about which evaluation design is likely to yield the most useful and relevant information for a given situation rather than generalizations about the relative superiority of one method over another. The best guidance, therefore, is not an a priori hierarchy of presumptively better and worse designs, but a process of thoughtful deliberation by knowledgeable and methodologi-

Page 41 Cite

Suggested Citation:"4 How Should an Impact Evaluation Be Designed?." National Research Council. 2005. Improving Evaluation of Anticrime Programs. Washington, DC: The National Academies Press. doi: 10.17226/11337.

×

cally sophisticated evaluation researchers that takes into account the particulars of the situation and the resources available.

GENERALIZABILITY OF RESULTS

As mentioned in the discussion above, one important aspect of an impact evaluation design may be the extent to which the results can be generalized beyond the particular cases and circumstances actually investigated in the study. External validity is concerned with the extent to which such generalizations are defensible. The highest levels of external validity are gained by selecting the units that will participate in the research on the basis of probability samples from the population of such units. For example, in studies of sentencing behavior, the researcher may select cases randomly from a database of all offenders who were convicted during a given period. Often in criminal justice evaluations, all available cases are examined for a specific period of time. In the Inmate Classification Experiment conducted by Berk (2003), 20,000 inmates admitted during a six-month period were randomly assigned to an innovative or traditional classification system.

There are often substantial difficulties in defining the target population, either because a complete census of its members is unavailable or because the specific members are unknown. For example, in the Multidimensional Treatment Foster Care study mentioned above, the researchers could not identify the population of juveniles eligible for foster care but rather drew their sample from youth awaiting placement. The researchers might reasonably assume that those youth are representative of the broader population, but they cannot be sure that the particular group selected during that particular study period is not different in some important way. To the extent that the researcher cannot assure that each member of a population has a known probability of being selected for the research sample used in the impact evaluation, external validity is threatened.

Considerations of external validity also apply to the sites in a multisite program. When criminal justice evaluations are limited to specific sites, they may or may not be representative of the population of sites in which the program is, or could be, implemented. Berk’s (2003) study of a prison classification system assessed impact at several correctional facilities in California, but not all of them. The representativeness of the sites studied will depend on how they are selected and can be assured only if they are a random sample of the whole population of sites. It is important not to confuse the level at which an inference can be made; for example, a researcher may select a sample of subjects from a single prison but interpret the results as if they generalized to the popu-

Page 42 Cite

Suggested Citation:"4 How Should an Impact Evaluation Be Designed?." National Research Council. 2005. Improving Evaluation of Anticrime Programs. Washington, DC: The National Academies Press. doi: 10.17226/11337.

×

lation of prisons. In the absence of additional information, the only strictly valid statistical generalization is to the prisoners from which the subject sample was drawn. An assumption that the program would work equally well in a prison with different characteristics and a different offender population may be questionable.

STATISTICAL POWER

Another important design consideration for impact evaluations is statistical power, that is, the ability of the research design to detect a program effect of a given magnitude at a stipulated level of statistical significance. If a study has low statistical power it means that it is likely to lead to a statistically nonsignificant finding even if there is a meaningful program impact. Such studies are “designed for failure”—an effective program has no reasonable chance of showing a statistically significant effect.

Statistical power is a function of the nature and number of units on which outcome data are collected (sample size), as well as the variability and measurement of the data and the magnitude of the program effect (if any) to be detected. It is common for criminal justice evaluations to ignore statistical power and equally common for them to lack adequate power to provide a sensitive test of the effectiveness of the treatments they evaluate (Brown, 1989; Weisburd, Petrosino, and Mason, 1993). An underpowered evaluation that does not find significant program effects cannot be correctly interpreted as a failure of the program, though that is often the conclusion implied (Weisburd, Lum, and Yang, 2003). For example, if a randomized experiment included only 30 cases each for the intervention and control conditions, and the effect of the intervention was a .40 recidivism rate for the intervention group compared to .65 for the control group, the likelihood that it would be found statistically significant at the p < .05 level in any one study is only about 50 percent though it is rather clearly a large effect in practical terms.

Even when statistical power is examined in criminal justice evaluations, the approach is frequently superficial. For example, it is common for criminal justice evaluators to estimate statistical power for program effects defined as “moderate” in size on the basis of Cohen’s (1988) general suggestions. Effect sizes in crime and justice are often much smaller than that, but this does not mean that they do not have practical significance (Lipsey, 2000). In the recidivism example used above, a “small” effect size as defined by Cohen would correspond to the difference between a .40 recidivism rate for the intervention group and .50 for the control group. A reduction of this magnitude for a large criminal population, however, would produce a very large societal benefit. It is important for

Page 43 Cite

Suggested Citation:"4 How Should an Impact Evaluation Be Designed?." National Research Council. 2005. Improving Evaluation of Anticrime Programs. Washington, DC: The National Academies Press. doi: 10.17226/11337.

×

evaluators to define at the outset the effect that is meaningful for the specific program and outcome that is examined.

The design components of a study are often interrelated so that manipulation of one component to increase statistical power may adversely affect another component. In a review of criminal justice experiments in sanctions, Weisburd et al. (1993) found that increasing sample size (which is the most common method for increasing statistical power) often affects the intensity of dosage in a study or the heterogeneity of the participants examined. For example, in the RAND Intensive Probation experiments (Petersilia and Turner, 1993), the researchers relaxed admissions requirements to the program in order to gain more cases. This led to the inclusion of participants who were less likely to be affected by the treatment, and thus made it more difficult to identify a treatment impact. Accordingly, estimation of statistical power like other decisions that a researcher makes in designing a project must be made in the context of the specific program and practices examined.

AVOIDING THE BLACK BOX OF TREATMENT

Whether a program succeeds or fails in producing the intended effects, it is important to policy makers and practitioners to know exactly what the program was that had those outcomes. Many criminal justice evaluations suffer from the “black box” problem—a great deal of attention is given to the description of the outcome but little is directed toward describing the nature of the program. For example, in the Kansas City Preventive Patrol Experiment (Kelling et al., 1974), there was no direct measure of the amount of patrol actually present in the three treatment areas. Accordingly, there was no objective way to determine how the conditions actually differed. It is thus important that a careful process evaluation accompany an impact evaluation to provide descriptive information on what happened during a study. Process evaluations should include both qualitative and quantitative information to provide a full picture of the program. If the evaluation then finds a significant effect, it will be possible to clearly describe what produced it. Such description is essential if a program is to be replicated at other sites or implemented more broadly. If the evaluation does not find an effect (as in Kansas City), the researcher is able to examine whether this was the result of a theory failure or an implementation failure.

THE LIMITATIONS OF SINGLE STUDIES

It is not uncommon in criminal justice to draw broad policy conclusions from a single study conducted at one site. The outcomes of such a

Page 44 Cite

Suggested Citation:"4 How Should an Impact Evaluation Be Designed?." National Research Council. 2005. Improving Evaluation of Anticrime Programs. Washington, DC: The National Academies Press. doi: 10.17226/11337.

×

study, however, may have more to do with the particular characteristics of the agency or personnel involved than with the strengths or weaknesses of the program itself. Note, for example, the variation Braga (2003) found in the effects of hot spots policing across five randomized control group studies. Similarly, a strong program impact in one jurisdiction may not carry over to others that have offenders or victims drawn from different ethnic communities or socioeconomic backgrounds (Berk, 1992; Sherman, 1992). This does not mean that single-site studies cannot be useful for drawing conclusions about program effects or developing policy, only that caution must be used to avoid overgeneralizing their significance.

Such circumstances highlight the importance of conducting multiple studies and integrating their findings so that meaningful conclusions can be drawn. The most common technique for integrating results from impact evaluation studies is meta-analysis or systematic review (Cooper, 1998). Meta-analysis allows the pooling of multiple studies in a specific area of interest into a single analysis in which each study is an independent observation. The main advantage of meta-analysis over traditional narrative reviews is that it yields an estimate of the average size of the intervention effect over a large number of studies while also allowing analysis of the sources of variation across studies in those effects (Cooper and Hedges, 1994; Lipsey and Wilson, 2001).

Another approach for overcoming the inherent weakness of single-site studies is replication research. In this case, studies are replicated at multiple sites within a broader program of study initiated by a funding agency. The Spouse Assault Replication Program (Garner, Fagan, and Maxwell, 1995) of the National Institute of Justice is an example of this approach. In that study, as in other replication studies, it has been difficult to combine investigations into a single statistical analysis (e.g., Petersilia and Turner, 1993), and it is common for replication studies to be discussed in ways similar to narrative reviews. A more promising approach, the multicenter clinical trial, is common in medical studies but is rare in criminal justice evaluations (Fleiss, 1982; Stanley, Stjernsward, and Isley, 1981). In multicenter clinical trials, a single study is conducted under very strict controls across a sample of sites. Although multicenter trials are rare in criminal justice evaluations, Weisburd and Taxman (2000) described the design of one such trial that involved innovative drug treatments. In this case a series of centers worked together to develop a common set of treatments and common protocols for measuring outcomes. The multicenter approach enhances external validity by supporting inferences not only to the respondent samples at each site, but also to the more general population that the sites represent collectively.