As more funding becomes available for community programs designed to promote youth development, higher expectations are being placed on programs to demonstrate (and not just to proclaim) that they do indeed promote the healthy development of youth. Well-established programs that draw on public funds are being adopted in cities and communities throughout the United States. Increasingly, these programs are being expected to demonstrate that they actually do make a difference in young people’s lives. It is precisely because these programs are being established in a wide variety of communities that it is important to know if they make a difference and, if so, under what conditions and for whom. Moreover, given the recent call for significant investments in public resources, it would betray public trust not to document the steps taken to implement these programs and to provide evidence of the effectiveness of programs.
How should one think about evaluation of community programs for youth in the future? What social indicators exist that help us understand community programs for youth? What else is needed in order to better understand and evaluate these programs?
Part III explores the various methods and tools available to evaluate these programs, including experimental, quasi-experimental, and nonexperimental methods. Each method involves different data collection techniques, and each affords a different degree of causal inference—that is, whether a particular variable or treatment actually causes changes in outcomes. Findings from evaluations using these methods were incorporated into Part II. We turn now to exploring evaluation methodologies in more detail, looking specifically at the role for evaluation (Chapter 7) and data collection (Chapter 8) for the future of these programs.
There are particular challenges inherent to evaluating community programs for youth. Many of them tend to be relatively new and are continually changing in response to growing interest and investments on the part of foundations and federal, state, and local policy makers. In addition, the elements of community programs for youth rarely remain stable and consistent over time, given that program staff are always trying to improve the services and the manner in which they are delivered. Moreover, some programs struggle to overcome barriers during the implementation phase—for example, to receive a license or permits, to acquire appropriate space or renovate a facility, or to recruit appropriate staff and program participants. As a result, early implementation of
programs may not follow the specific plan. Evaluation involves asking many questions and often requires an eclectic array of methods. Deciding what questions to ask and the methods to use for the development of a comprehensive evaluation is important to the wide range of stakeholders of community programs for youth.
Generating New Information
This chapter explores the role of program evaluation in generating new information about community programs for youth. Evaluation and ongoing program study can provide important insights to inform program decisions. For example, evaluation can be used to ensure that programs are acceptable, assessable, developmentally appropriate, and culturally relevant according to the needs of the population being served. The desire to conduct high-quality evaluation can help program staff clarify their objectives and decide which types of evidence will be most useful in determining if these objectives have been met. Ongoing program study and evaluation can also be used by program staff, program users, and funders to track program objectives; this is typically done by establishing a system for ongoing data collection that measures the extent to which various aspects of the programs are being delivered, how are they delivered, who is providing these services, and who is receiving them. In some circles, this is referred to as reflective practice. Such information can provide very useful information to program staff to help them make changes to improve program effectiveness. And finally, program evaluation can test new or very well-developed
program designs by assessing the immediate, observable results of the program outcomes and benefits associated with participation in the program. Such summative evaluation can be done in conjunction with strong theory-based evaluation (see later discussion) or as a more preliminary assessment of the potential usefulness of novel programs and quite complex social experiments in which there is no well-specified theory of change.1 An example of the latter is the Move to Opportunity programs, in which poor families were randomly assigned to new housing in an entirely new neighborhood. In other words, program evaluation and study can help foster accountability, determine whether programs make a difference, and provide staff with the information they need to improve service delivery. They can generate new information about community programs for youth and the population these programs serve and help stakeholders know how to best support the growth and development of programs.
Different types of evaluation are used according to the specific questions to be addressed by evaluation. In general, there are two different types of evaluations relevant to this report:
Process evaluation (formative evaluation) describes and assesses how a program operates, the services it delivers, and the activities it carries out and
Outcome evaluation (summative evaluation) identifies the results of a program’s efforts and considers what difference the program made to the young people after participating.
These two evaluative approaches can be thought of as a set of assessment options that build on one another, allowing program staff to increase their knowledge about the activities that they undertake as they incorpo-
Box 7–1 Process and Outcome Evaluation
Process evaluation helps explain how a program works and its strengths and weaknesses, often focusing on program implementation. It may, for example, document what actually transpires in a program and how closely it resembles the program’s goals. Process evaluation is also used to document changes in the overall program and the manner in which services are delivered over time. For example, a program administrator might observe that attendance at a particular activity targeted to both youth and their parents is low. After sitting down and talking with staff, the program administrator discovers that the activity has not been sufficiently advertised to the parents of the youth participating in the program. Moreover, the project team realizes that the times in which the activity is scheduled are inconvenient for parents. Increased and targeted outreach to parents and a change in the day and time of the activity result in increased attendance. Thus, process evaluation is an important aspect of any evaluation, because it can be used to monitor program activities and help program staff to make decisions as needed.
Outcome evaluation facilitates asking if a program is actually producing changes on the outcomes believed to be associated with the program’s design. For example, does participation in an after-school program designed to improve social and communication skills actual lead to increased engagement with peers and participation in program activities? Are the participants more likely to take on leadership roles and participate in planning program activities than they were before participating?
rate more options or activities into their evaluation (see Box 7–1 for elaboration). They can also serve as the foundation from which programs justify the allocation of public and private resources.
Process and outcome evaluations both rely on the collection of two types of data—qualitative and quantitative data (see Box 7–2 for elaboration). Quantitative data refer to numeric-based information, such as descriptive statistics that can be measured, compared, and analyzed. Qualitative data refer to attributes that have labels or names rather than numbers; they tend to rely on words and narrative and are commonly used to describe program services and characterize what people “say” about the programs.
Box 7–2 Quantitative and Qualitative Data
Quantitative data are commonly used to compare outcomes that may be associated with an intervention or service program for two different populations (i.e., those who received the service—or participated in the program—and those who did not) or from a single population but at multiple time points (e.g., before, during, and after participation in the program). For example, assume that an organization wants to compare two different types of tutoring programs. To accomplish this goal, they give a structured survey with numeric coded responses (such as a series of mathematical problems or a history or English knowledge test) to participants at the time they enter the program, and 3 and 6 months after the start of the program. The data collected by this survey (for example, scores on the tests of mathematical, historical, or English knowledge) are statistically analyzed to determine differences between the participants from each group.
Qualitative data are often derived from narratives and unstructured interviews or participant observations. A common misconception is that qualitative methods lack rigor and therefore are not scientific. In fact, qualitative methods can be just as scientific (meaning objective and empirical) as quantitative methods. They are the basis for much descriptive and classification work in both the social and natural sciences. They provide an opportunity to systematically examine organizations, groups, and individuals in an effort to extract meaning from situations; understand meaning ascribed to behaviors; clarify or further explore quantitative findings; understand how a service program operates; and determine whether a program or services can be adapted for use in other contexts and with other populations.
EVALUATING COMMUNITY PROGRAMS FOR YOUTH
In our review of studies and evaluations that have been conducted on community programs serving youth, we found that a wide range of evaluation methods and study designs are being used, including experimental, quasi-experimental, and nonexperimental methods (see Box 7–3). Part II provided examples of community programs for youth using both experimental or quasi-experimental methods, as well as a range of other nonexperimental methods of study, including interviews, focus
Box 7–3 Evaluation Design Methodologies
Experimental design involves the random assignment of individuals to either a treatment group (in this case participation in the program being assessed) or a control group (a group that is not given the treatment). Many believe that the experimental design provides some of the strongest, most clear evidence in research evaluation. This design also affords the highest degree of causal inference, since the randomized assignment of individuals to an intervention condition restricts the opportunity to bias estimates of the treatment effectiveness.
Quasi-experimental design has all the elements of an experiment, except that subjects are not randomly assigned to groups. In this case, the group being compared with the individuals receiving the treatment (participating in the program) is referred to as the comparison group rather than the control group, and this method relies on naturally occurring variations in exposure to treatment Evaluations using this design cannot be relied on to yield unbiased estimates of the effects of interventions because the individuals are not assigned randomly. Although quasi-experimental study designs can provide evidence that a causal relationship exists between participation in the intervention and the outcome, the magnitude of the effect and the causation are more difficult to determine.
Nonexperimental design does not involve either random assignment or the use of control or comparison groups. Nonexperimental designs rely more heavily on qualitative data. These designs gather information through such methods as interviews, observations, and focus groups in an effort to learn more about the individuals receiving the treatment (participating in the program) or the effects of the treatment on these individuals. This type of research often consists of detailed histories of participant experiences with an intervention. Although they may contain a wealth of information, nonexperimental studies cannot provide a strong basis for estimating the size of an effect or for unequivocally testing causal hypotheses because they are unable to control for such factors as maturation, self-selection, attrition, or the interaction of such influences on program outcomes. They are, however, particularly useful for generating new hypotheses, for developing classification systems, and for gaining insight into people’s understandings.
groups, ethnographic studies, and case studies. It also summarized some findings from studies using both nonexperimental and experimental methods.
There are differing opinions among service practitioners, researchers, policy makers, and funders about the most appropriate and useful methods for the evaluation of community programs for youth (Catalano et al., 1999; McLaughlin, 2000; Kirby, 2001; Connell et al., 1995; Gambone, 1998). Not surprising, there was even some disagreement among committee members about the standards for evaluation of community programs for youth. Through consideration of our review of various programs, the basic science of evaluation, and a set of experimental evaluations, quasi-experimental evaluations, and nonexperimental studies of community programs for youth, the committee agreed that no specific evaluation method is well suited to address every important question. Rather, comprehensive evaluation requires asking and answering many questions using a number of different evaluation models. What is most important is to agree to, and rely on, a set of standards that help determine the conditions under which different evaluation methods should be employed and to evaluate programs using the greatest rigor possible given the circumstances of the program being evaluated. At the end of this chapter, we present a set of important questions to be addressed in order to achieve comprehensive evaluation, whether it be by way of experiments or other methods.
QUESTIONS ASKED IN COMPREHENSIVE EVALUATION
To fully realize the overall effectiveness and impact of a program on its participants, a comprehensive evaluation should be conducted. A comprehensive evaluation addresses six fundamental questions:
Is the theory of the program that is being evaluated explicit and plausible?
How well has the program theory been implemented in the sites studied?
In general, is the program effective and, in particular, is it effective with specific subpopulations of young people?
Whether it is or is not effective, why is this the case?
What is the value of the program?
What recommendations about action should be made?
These questions have a logical ordering. For instance, answers to the middle questions depend on high-quality answers to the first two. To give an example, if one is not sure that a program has been well implemented, what good does it do to ask whether it is effective? The appropriate questions to answer in any specific evaluation depend in large part on the state of previous research on a program and hence often on the program’s maturity. The more advanced these matters are, the more likely it is that one can go on to answer the later questions in the sequence above. While answering all six questions may be the ultimate goal, it is important to note that it is difficult to answer all questions well in any one study. Consequently, evaluations should not routinely be expected to do so. However, with multiple evaluation studies, one would hope and even expect that all of these questions will be addressed. Answers to all of these questions may also be obtained by synthesizing the findings from many evaluations conducted with the same organization, program, or element. They may also be answered by using various evaluation methods. We now take a brief look at these questions, realizing that they are interdependent.
Is the Theory of the Program Plausible and Explicit?
Good causal analysis is facilitated by a clear, substantive theory that explicitly states the processes by which change will occur. Ideally, this theory should be the driving force in program development and should guide the decision of what to measure and how. The theory should explicitly state the components that are thought to be necessary for the expected effects to occur (i.e., the specific aspects of the programs—such as good adult role models—that account for the programs’ effects on such outcomes as increasing self-confidence). It should also detail the various characteristics of the program, the youth, and the community that are likely to influence just how effective the programs is with specific individuals (i.e., the moderators of the program’s effectiveness, such as ethnicity or disability status). These components must then be measured to evaluate whether they are in place in the form and quantity expected.
At a more abstract level, the theory must also be plausible, taking into account current knowledge in the relevant basic sciences, the nature of the target population, and the setting in which the program intervention takes place. We provided an initial review of the existing literatures in Chapters 2, 3, and 4. Obviously, a program needs no further evalua-
tion if this stage of the analysis shows that its most fundamental assumptions are at variance with well-supported current substantive theory or with the level of human and financial resources that can be made available to support it. Once specified, these causal models can be used for two purposes: (1) to design the assessment of the quality of program implementation and (2) when all the data are in, to judge how plausible it is to infer that the observed results are due to processes built into the program theory. Several of the evaluations we looked at provided good models of such an analysis. One is the evaluation of the Girls Inc.’s Preventing Adolescent Pregnancy Project (e.g., Postrado et al., 1997).
Finding a well-specified model underlying either program design or program evaluation is unusual. Few of the evaluations reviewed and discussed in this report had one (see Chapter 6 for examples). Development of such theoretically grounded program-specific models probably requires a prolonged and genuine collaboration between basic researchers, applied researchers, program developers, practitioners, and program evaluators. We highly recommend that more funds be directed to support such ongoing collaborations. Information in Chapters 2, 3, 4, and 5 provides a rudimentary basis for developing such theoretically grounded program models. Complex generic models are needed that involve theorizing at the community, program, cultural, and individual levels because these will help determine the classes of variables to be included in both program and outcome evaluations, and both ongoing reflective practice and program study. Gambone, Connell, and their colleagues developed one such model (Gambone, 1998; Connell et al., 2000). Another formed the basis for Public/Private Ventures’ new project around community change for youth development (Sipe et al., 1998). Yet others were the basis of the Midwestern Prevention Project and the Communities that Care initiative (described in more detail in Chapter 5). But even these are quite general models. Models that are quite specific to the program under analysis are needed. Cook and his colleagues developed one such a model for their evaluation of the Comer School Intervention (Cook et al., 1999; Anson et al., 1991; Cook et al., 2000). This work is discussed in more detail later in the chapter.
How Well Was the Program Implemented?
Part of a program’s operational theory deals with the treatment components and dosage necessary to achieve the expected outcomes. An empirical description and analysis of the implementation of the compo-
nents of a program are therefore necessary to determine: (1) whether participants received the intended treatment and (2) how variation in the treatment actually received might be related to outcomes. However, too many outcome evaluations still contain sparse or nonexistent information about particular program elements. Without such a description, it is impossible to know, for instance, if lack of effectiveness might be due to poor program theory, poor implementation of what might otherwise be an effective treatment, or even poor evaluation that failed to detect true effects. If evaluators find that participants did not receive the planned treatment and the expected effects did not occur, then valuable information has been gained. It is obvious that the program is not likely to be effective in its current form, that a high priority of redesign is needed to improve implementation quality, and that the theory behind the program has not been tested in the evaluation since quality implementation is a precondition for a strong test of the theory. Without measures of implementation, the program can be evaluated in terms of its effectiveness, but the theory behind the program cannot be evaluated.
Obviously, it is impossible to achieve perfect implementation in the real world of social programming, but there is much to learn from empirical details about implementation, even when the expected effects do not occur. We found few examples of even systematic attempts at implementation assessment (one such example is the follow-up evaluations of Big Brothers, Sisters). Cook and colleagues’ evaluations of the Comer School Intervention provide a good example of careful implementation assessment (Cook et al., 1999; Anson et al., 1991; Cook et al., 2000). We found a few good examples of attempts to measure exposure; these included ongoing evaluations of the Teen Outreach Program by Allan and colleagues (Allan and Philliber, 1998) and the ongoing evaluations of the Girls, Inc.’s Adolescent Pregnancy Prevention Project (Nicholson and Postrado, 1992; Postrado et al., 1997).
Programs in the early stages of development should focus their evaluation activities on implementation rather than outcome analysis. This is the period when the program itself is likely to change most rapidly as the original conception requires some accommodation to newly learned realities of place, time, and people. This process of growth should be carefully studied in its own right for the information it provides to a basic understanding of social systems and human development. Evaluation focused only on effectiveness is most appropriate when the program has matured, the theory of change has crystallized, and implementation has been studied enough to know that the program on the ground is not
some distorted caricature of the program intended. In this case, one can reach reasonable conclusions about effectiveness since one has confidence that the program itself was well implemented and the outcome measures are indeed theoretically related to the “treatment.” Both significant and null effects can inform conclusions about program effectiveness.
In determining how well a program was implemented, it is important not to forget that the comparison group is never a “no-treatment” group. It is essential to describe what happens in the comparison groups in terms of their opportunities, constraints, and exposure to positive developmental activities. Individuals not included in the experimental treatment may nonetheless gain access to, or be exposed to, similar (even if not identical) activities through other local organizations or through the influence of their friends who are in the treatment group. It is essential to describe these experiences, for few programs exist in a void, and many young people have potential access to more than one program with overlapping services. All experimental evaluations assess the consequences of exposure not of the experimental treatment in a void, but in terms of the contrast between the treatment and the comparison groups. When the comparison group experiences similar events to the treatment group, we can hardly expect to find treatment effects. Such would have to be due to the name of the organization rather than to the specific activities undertaken there. In this case, the finding of nonsignificant effects (null effects) should not be interpreted as an indication that the program is ineffective. Because such careful assessment of exposure to treatment and treatment-like conditions is rare and because some individuals in the treatment condition may not actually receive it, first-order estimates of treatment effects are conservative in that one is rarely sure of the true magnitude of differential exposure of the treatment and control group to the program. Nonetheless, these estimates do provide information to policy makers regarding the value-added of putting the treatment program into a community.2
The process evaluation to provide a description of implementation quality can be carried out using a wide variety of both qualitative and quantitative data collection techniques. Among the qualitative methods are ethnographic data collection using a time-sampling basis or open-ended interviews with young people and service practitioners and man-
agers. Ethnographic studies allow researchers to examine unexpected events in depth, to understand important mediating processes not part of the original theory, and to describe how the program evolves over time. Such methods require a substantial time and financial investment, but in our judgment, they are necessary for collecting rich descriptive data that allow the researchers to understand the program better and to communicate these understandings more graphically. Of course, the same kinds of data need to be collected with comparison group members whenever possible, adding to the time and dollar commitment involved.
Among the quantitative methods for describing implementation are closed-ended interviews or questionnaires administered to youth in the each group and to service practitioners. Good quantitative analysis requires a clear program theory and measures that can be used over time to determine the extent to which the program is moving toward sound implementation. When done well, these quantitative analyses can provide a good assessment of how well an intervention has been implemented in the way it was formulated at the time the data collection began. Some questionnaire items can also ask about changes in order to document the dynamic, changing nature of some programs.
It is important to remember that in experimental program evaluations, it is highly desirable—if not necessary in theory—that the assessment of implementation quality be identical for the treatment and control groups (the same is actually also true in other forms of evaluation). A difficulty here is that the control group members may be in a single alternative program or in none at all. Moreover, if implementation scores are to be attached to individual young people to describe the variation in treatment they actually receive, then the desirability of assessing implementation for all young people rather than a sample of them or for the setting in the aggregate increases. The greater the anticipated variability in individual exposure patterns, the greater the need for respondent-specific measures of implementation. These can be expensive and disruptive, although means do exist to deal with each of these concerns.
An important aspect of implementation is dropping out of a program. Dosage usually involves some form of length of program exposure. Indeed, program practitioners are reluctant for their program to be evaluated by criteria that include in the evaluation those who have had limited exposure to program activities. This is understandable; consequently, at a minimum, exposure lengths need documenting, since retaining young people in the program is something all programs strive for and success in this area speaks to the developmental appropriateness of
the content and the skills of the staff. If it turns out that there are group differences in the length of exposure to program activities in a randomized experiment comparing two different programs, for example, then this information can be used to get an unbiased program effect. It is important that the information on differences in exposure is considered in relation to the variation in changes in important developmental outcomes. Since practitioners certainly want to see analyses that evaluate their program at its strongest (even if not in its most typical state), some subanalyses assessing impacts for the subset of young people that attends a program for a long period of time and is exposed to activities that meet the state of the art in youth programming are important.
Is the Program Effective?
Various evaluation methods can be used to answer this question, including randomized experiments, quasi-experiments, interrupted treatment designs, theory-based evaluation, and qualitative case studies
This effectiveness question can be rephrased in the following way: Would any observed changes in the youth exposed to the activities under evaluation have occurred without the program? In other words, would the changes have occurred because of temporal maturation alone, because of selection differences between the kinds of young people exposed to two different sets of experiences that are being contrasted, due to statistical regression, or due to being tested on two different occasions and learning from the first one what to say or write on the second one? Although attributing change to the treatment by ruling out the effects of alternative change-inducing forces is central to science, funders, and most taxpayers, it seems to be a lower priority for many program practitioners. Documenting desired change during the course of program exposure is not the same as attributing the change to the program rather than to one of the many other forces that get people to change spontaneously.
The best way to answer the effectiveness question is through an evaluation that is experimental in design. However, despite the fact that experimental designs with random assignment allow the highest level of confidence in evaluating planned program effects (Shadish et al., 2001), true experimental designs with random assignment are not always the most appropriate design; these designs are most useful when a causal
question is being asked and not all evaluations are designed to address a causal question. (Common noncausal questions include: How well was the program implemented? What is the cost-benefit ratio for the program?) Randomized experiments are also not always practical, although they have turned out to be feasible in far more contexts than was thought to be the case even a decade ago (Gueron, 2001).
Nonetheless, they are often difficult to implement and program staff are often reluctant to participate in experimental evaluations for a variety of reasons—ranging from the difficulty and inconvenience often imposed by this method to very serious concerns about the ethics of denying an opportunity or service to a large portion of the adolescents needing it. Even so, experiments are the method of choice for evaluating the effectiveness of programs in communities, and program staff need to be part of the ongoing debates about the feasibility of random assignment, raising their objections and carefully considering the responses made to these objections. What is not desirable is that some veto be pronounced without serious consideration having been given to a strong design. In this connection, it is important for those who fund evaluation to be explicit in favoring random assignment experimental evaluations when the question of cause and effect is being asked. Such favoring exists in most disciplines today, but there is mixed support for this method in the emerging area of community programs for youth. And although the ethics of denying service to needy adolescents is a very serious concern, there are often more applicants for the program than can be served. In this case, random selection either into the program or onto a wait list can be implemented in a manner that is fair to all potential participants.
Strong quasi-experimental designs are sometimes a more realistic approach to assessing effectiveness. Shadish, Cook, and Campbell (2001) discuss a wide number of quasi-experimental designs that vary in their strength for inferring a causal relationship between programs and change in young people. At a minimum, a strong quasi-experimental design entails two things: (1) extensive knowledge of young people’s behavior and attitudes prior to or at the onset of treatment exposure, including pretest information on the very same measures that will be used for the posttest outcome assessment, and (2) comparison groups that are deliberately and carefully selected to be minimally different from the program groups. The latter is attained through careful matching procedures on
reliable variables that are as highly correlated with the major outcomes as possible. Even stronger designs are outlined in Shadish, Cook, and Campbell (2001).
Done properly, both experimental and quasi-experimental methods provide quite valid information about program effectiveness. Internal validity is best addressed by the random assignment of subjects to control and experimental groups. But without it, pretest measures can be used to make sure the groups are similar on the highest and most distinct correlates of the major outcomes. Then statistical tests can be used to determine whether the two groups remain similar over time (that is, that those who drop out of the study from the two groups are similar to each other—leaving the remaining members of the two groups similar to each other on the pretest measures). Analyzing the experimental-control group contrast in order to assess whether there are differences between the groups on valid and theoretically appropriate outcome measures provides sound evidence of the effectiveness of the program.
However, as discussed earlier, doing rigorous experimental and quasi-experimental evaluation studies is very difficult for community programs for youth because such programs, by their very nature, make randomized trial designs very difficult to implement. Consequently, stakeholders should be sure they need this level of evaluation before asking for it. Also, as discussed previously, the best programs for youth are dynamic and evolving in nature. In other words, they are continually experimenting with ways to improve and are better characterized as learning systems than as a tightly specified program. Such programs are difficult to evaluate with a randomized trial design. Box (1958) has provided an alternative model of evaluation that is better suited to this type of situation—the “evolutionary evaluation method.” Essentially, Box outlines a method for doing mini-randomized trials within an organization to test the effectiveness of new activities and program modifications. As far as we know, this method has not been used in community programs for youth, but it would a useful tool for organizations to use as they make changes in their array of programs.
Interrupted Time-Series Designs
Interrupted time-series is another methodology that can be used for evaluating programs aimed at youth in a community (Biglan et al., 2000). The method requires a series of observations prior to entering a program and then at least one observation afterward, although more are desir-
able. These observations can be of individual young people or on some collective, such as those attending the program at previous time points. But in the latter case, the object of evaluation has to be some change within a program rather than the program itself. Statisticians argue for about 100 time points, but that is usually not possible. This is why Cook and Campbell (1979) and Shadish et al. (2001) argue in favor of abbreviated time-series in which the deviation from a past trend can be observed even if the nature of the correlated error in the series has to be assumed rather than directly measured. At issue is noting whether there is a shift in the abbreviated series’ mean or trend following entry into a program, or whether it is a program introducing some new practice that is being evaluated. The value of this approach is that it can rule out alternative interpretations to the program effect that are based on maturational changes in the young people or on statistical regression common in situations in which a program is begun or modified because the situation in a community has suddenly worsened. The major problem, though, is that events that occur simultaneously with the new program or practice may also be responsible for any changes observed. To counter this alternative explanation, it is often useful to introduce a control time-series from a location nearby, where the new practice under evaluation was not implemented. Shadish et al. (2001) provide a long list of variants on the abbreviated time-series analysis that can be used in specific kinds of circumstances.
Yet another evaluation possibility is offered by regression-discontinuity designs (Shadish et al., 2001). These have been repeatedly reinvented over the past 35 years in one social science discipline after another. Their key element is assignment to a program based on a quantitative score and nothing else. Thus, if young people can be assessed in terms of need or merit or their turn in the queue to enter a program, and if program participation depends on this score and nothing else, then an unbiased inference can be made about the program’s effects. If there is a main effect of the treatment, then there will be a discontinuity in the intercept of the regression line at the score that determines treatment eligibility; and if there is an interaction with the assignment score (or some correlate thereof), then the regression slopes will differ on each side of the cutoff. It seems counterintuitive at first that this design should result in unbiased inference. After all, assignment to treatment depends on the very selection processes that the randomized experiment was designed to rule out. Yet both designs have a key characteristic in common—that the assignment to treatments is completely known. This is
the key design characteristic necessary for producing an unbiased statistical inference. With the randomized experiment, assignment is known to depend only on some equivalent to the coin toss, while in the regression-discontinuity design, it depends only on the score on the quantitative assignment variable.
Neither of these evaluation possibilities has been used much for examining youth programs—and neither will be easy to implement. Yet there are circumstances in which there is a long previous time-series of information at either the individual or the community level. We recommend in the next chapter that communities gather information about the well-being of their youth on a much more regular basis. If such information were systematically available in more communities, using the interrupted time-series design to assess program effectiveness at the community level would become much easier. There are also likely to be other circumstances in which it is ethical and politically acceptable to assign to services only those most in need, those most meritorious, or those who most press for program entry. Often this occurs when there is more demand for the program than spaces available. Of course, having more demand for the program than supply also means that a randomized experiment is feasible, and a randomized experimental evaluation is superior to the regression-discontinuity design, even if only because its statistical power is greater. If evaluationary practice is to be improved in the youth development area, there will have to be greater concern for random assignment, regression-discontinuity, interrupted time-series, and quasi-experiments that have strong rather than weak designs. At a minimum, this means designs with pretest measures on the same variables as the outcomes, matched control groups, replications of the treatment, and statistical analyses that take account of modern developments in statistics, especially propensity score analyses (Rosenbaum, 1992) and recent versions of instrumental variables that emphasize local control groups, pretests, direct observation of the determinants of program participation and sensitivity analyses (Heckman et al., 1996).
Some evaluators have suggested an alternative method for inferring effectiveness—theory-based evaluation. These evaluation theorists believe that theory-based evaluation can answer the question of whether the program is effective.
Theory-based evaluation acknowledges the importance of substantive theory, quantitative assessment, and causal modeling, but it does not require experimental or even quasi-experimental design. Instead it focuses on causal modeling derived from a well-specified theory of the processes that take place between program inputs and individual change. If the causal modeling analyses suggest that the obtained data are consistent with what the program theory predicts, then it is presumed that the theory is valid and success of the program has been demonstrated. If time does not permit assessing all the postulated causal links, information on the quality of initial program implementation nonetheless will be gathered because implementation variables are usually the first constructs in the causal model of the program. If the first steps in the theory are proceeding as predicted, the evaluators can then recommend that further evaluations be conducted when sufficient time has passed for the proposed mediating mechanisms to have their full effect on the proposed outcomes. This should prevent any inclinations toward premature termination of programs, even though it does not demonstrate that the ultimate outcomes have been reached.
It is without question that the analysis of substantive theory is necessary for high-quality evaluation. Otherwise, analyses of implementation and causal processes cannot be carried out. However, the key question is whether a theory-based model can substitute for experiments rather than be built into them. It is not logical that long-term effects will come about just because the proposed mediators have been put into place. Without control groups and sophisticated controls for selection, it is impossible to conclude with any reasonable degree of confidence that the program is responsible for the changes observed rather than some co-occurring set of events or circumstances. Theory-based experimental approaches provide more confidence that it is the program itself that is accounting for the effects, and they also provide strong clues as to whether the program will continue to be effective in other sites and settings that can recreate the demonstrably effective causal processes. We support the use of program theory in evaluation, but as adjuncts to experiments rather than as substitutes.
Qualitative Case Studies
Some argue that qualitative case studies are sufficient for producing usable causal conclusions about a program’s effects. These are studies that focus on gaining in-depth knowledge of a program through inter-
views with various individuals involved, on-site observation, analysis of program records, and knowledge of the literature on programs like the one under review. Such studies involve little or no quantitative data collection and hence little manipulation of the data. Moreover, although some case studies include multiple program sites both for purposes of generalization and for comparison of different kinds of programs, case studies often concentrate on only a few sites, sometimes only one, because resources are usually too limited to collect in-depth information on many sites. The result is therefore to gain in-depth knowledge of only a few sites.
Case studies, done singly or at a sample of sites, are useful for describing and analyzing a program’s theory, for studying program implementation, for surfacing possible unintended side effects, and for helping explain why the program had the effects it did. They are also very useful at the stage of theory development. Some programs are thought to work: sometimes this conclusion is based on randomized treatment designs of omnibus programs; at other times, it is based on more subjective criteria, such as user satisfaction or continued evidence of high performance by its users on criteria valued in the community. High-quality case studies can help one understand more about these programs. In turn, this information can be used to design and then evaluate the effectiveness of these newly designed programs using the more quantitative experimental designs discussed earlier in this chapter.
Qualitative case studies can also help reduce some of the uncertainty about whether a program has had specific effects, since they often rule out some of the possible spurious reasons for any changes noted in an outcome. Finally, qualitative case studies often provide exactly the kinds of information that are useful to policy makers, journalists, and the public. These studies provide the kind of rich detail about what it means to youth and their families to participate in particular programs. Of course, such information can be misused, particularly if it is not gathered with rigorous methods; such misuse of information is also possible using quantitative experimental methods. But when done with scientific rigor, particularly if done in conjunction with rigorous experimental and quantitative quasi-experimental studies, qualitative information can provide very important insights into the effectiveness of youth programs.
A separate issue is whether such studies reduce enough uncertainty about cause to be useful. The answer is complex and depends in part on how this information is to be used. In the committee’s view, such information can often be useful to program practitioners who want informa-
tion that will help them improve what is going on at their site. Case studies are probably a better source of information than having no information about change and its causes, and program personnel want any edge they can get to improve what they do. Leavened by knowledge from the existing relevant literature, the data collected on site, and the staff’s other sources of program knowledge, the results of a case study can help local staff improve what they do.
However, when the purpose of the evaluation is to estimate the effectiveness of a program in helping participants, we have doubts about the utility of case studies. There are three reasons for this. First, it is often difficult to assess how much each participant has changed from their entry into the program to later. Second, and perhaps most importantly, there is no counterfactual against which to assess how much the participants would have changed had they not been in the program. And third, conclusions about effectiveness always involve a high degree of generalization—e.g., across persons, service practitioners, program inputs—and many qualitative researchers are reluctant to make such generalizations. They prefer to detail many known factors that make some difference to an outcome, and this is not the same as noting what average effect the program has or how its effectiveness varies by just a few carefully chosen factors.
Identifying the Most Appropriate Method
Experimental evaluation methods are often considered the gold standard of program evaluation and are recommended by many social scientists as the best method for assessing whether a program influences developmental outcomes in young people. However, many question the feasibility, cost, and time intensiveness of such methods for evaluating community programs for youth. In order to generate new, important information about community programs for youth, the committee recommends that the kind of comprehensive experimental evaluation discussed in this chapter be used under certain circumstances (see also footnote 1):
The object of study is a program component that repeatedly occurs across many of the organizations currently providing community services to youth;
An established national organization provides the program being evaluated through a system of local affiliates; and
Theoretically sound ideas for a new demonstration program or project emerge, and pilot work indicates that these ideas can be implemented in other contexts.
Comprehensive experimental evaluation is useful when the focus of evaluation is on elements that can be found in many organization providing services to youth: how recruitment should be done and continued participation should be supported; how youth and parent involvement should be supported; how youth’s maturity and growing expertise should be recognized and incorporated into programming; how staff members should be recruited and then trained; how coordination with schools should be structured; how recreational and instructional activities should be balanced; how mentoring should be carried out; how service learning should be structured and supported; and how doing homework should be supported. Typically, these are components in any single organization, but they are extremely important because they are common issues in organizations across the country. In our view, such elements are best examined by way of experimental research in which a sample of organizations is assigned to different ways of solving the problem identified. Although such work has the flavor of basic research on organizational development, it is central to an effectiveness-based practice of community youth programming.
Equally critical to an effectiveness-based practice approach is knowledge of the individual- and community-level factors that influence the effectiveness of these practices for different groups of individuals or communities. Consequently, it is also important that experimental methods be used to assess the replicability of such practices in different communities and with different populations of youth. Particular attention here needs to be paid to such individual and community differences as culture, age and maturity, sex, disability status, social class, educational needs, and other available community resources.
Comprehensive experimental evaluations are also called for in two other contexts. The first is when the target of evaluation is a national organization that has affiliates in many locations across the United States. The best model of this is the evaluation completed on Big Brothers Big Sisters to assess the effects of membership and of participation across the affiliates included in the sampling design. Many of these national organizations have been providing services to youth for many years and carry a disproportionate burden of current service provision. As a result, even when programs are still developing at the margins, many have mature
program designs. Since the total effect of these organizations is amplified across its affiliates, experimental evaluation with random assignment helps illuminate how effective these programs are.
The final context for comprehensive experimental evaluation is when some bold new idea for a new kind of service surfaces and critical examination shows that the substantive theory behind the idea is reasonable and that it is indeed likely to be able to be implemented. This situation is often called a demonstration project. Such demonstrations provide the substantive new ideas out of which the next generation of superior services is likely to emerge. As such, they deserve to be taken very seriously and to be evaluated by rigorous experiments.
Programs that meet the following criteria should be studied on a regular, ongoing basis with a variety of either nonexperimental methods or more focused experimental, quasi-experimental and interrupted time-series designs, such as those advocated by Box (1958):
An organization, program, project, or program element has not sufficiently matured in terms of its philosophy and implementation;
The evaluation has to be conducted by the staff of the program under evaluation;
The major questions of interest pertain to the quality of the program theory, implementation of that theory, or the nature of its participants, staff, or surrounding context;
The program is quite broad, involving multiple agencies in the same community; and
The program or organization is interested in reflective practice and continuing improvement.
If Effective, Why?
An explanation of the reasons why a program is effective is important because it identifies the processes that are thought to be present for effectiveness to occur (Cook, 2000). This knowledge is crucial for replication of program effects at new sites because of the uniqueness inherent in delivering the program to new populations and in new settings. Causal or explanatory knowledge not only identifies the critical components of program effectiveness, but also specifies whether these components are moderator variables (variables that change the relation between an intervention and an outcome; common moderator variables include all types
of individual difference constructs, such as sex, ethnic group, disability status, age, and social class) or mediator variables (variables that mediate the impact of an intervention on specific outcome variables; common mediators variables are the many personal and social assets discussed in Chapter 3—these are often hypothesized to mediate the impact of program features on adolescent outcomes, such as school achievement, avoidance of getting involved in problematic behaviors, and conditions such as very early pregnancy). Supported by a clear understanding of the causal processes underlying program effectiveness, practitioners at new sites can decide how the processes can best be implemented with their unique target population and their unique community characteristics.
Mixed methods are the most appropriate way to answer the question of why a program is effective. Theory-based evaluation is especially appropriate here and depends on the following steps (Cook, in press):
Clearly stating the theory a program is following in order to bring about change. This theory should explicitly detail the program constructs of both mediator and moderator relations that are supposed to occur if the intended program intervention is to impact major target outcomes. Chapters 2 and 3 can serve as an initial basis for developing elaborated theories of change.
Collecting both qualitative and quantitative data over time to measure all of the constructs specified in the program’s theory of change.
Analyzing the data to assess the extent to which the predicted relations among the treatment and the outcome variables have actually occurred in the predicted time sequence. If the data collection is limited to only part of the postulated causal chain, then only part of the model can be tested. The goal, however, should be to test the complete program theory.
A qualitative approach to theory-based evaluation collects and synthesizes data on why an effect came about; through this process, this approach provides the basis to derive subsequent theories of why the change occurred. The qualitative data are used to rule out as many alternative theories as possible. The theory is revised until it explains as much of the phenomenon as possible.
Both quantitative and qualitative implementation data can also tell us a great deal about why programs fail. In addition, these studies make it clear how the programs are nested into larger social systems that need
to be taken into account. When adequate supports are not available in these larger social systems, it is unlikely that specific programs will be able to be implemented well and sustained over time.
If Effective, How Valuable?
If a youth program is found effective, a comprehensive evaluation can then ask: Is it more valuable than other opportunities that could be pursued with the resources devoted to the program? Or, less comprehensively, is it more valuable than other programs that pursue the same objective? The techniques of benefit-cost analysis and cost-effectiveness analysis can offer partial but informative answers to these questions.
The fundamental idea of benefit-cost analysis is straightforward: comprehensively identify and measure the benefits and costs of a program, including those that arise in the longer term, after youth leave the program, as well as those occurring while they participate. If the benefits exceed the costs, the program improves economic efficiency—the value of the output exceeds the cost of producing it—and makes society better off. If the costs exceed the benefits, society would be better off devoting the scarce resources used to run the program to other programs with the same goal that do pass a benefit-cost test, or to other purposes.
Choices among competing uses of scarce public and nonprofit resources inherently embody judgments about relative benefits and costs. Benefit-cost analysis seeks to make the basis of such choices explicit so that difficult trade-offs can be better weighed. At the same time, benefit-cost analysis neither can nor should be the sole determinant of funding decisions. Aside from the limitations of any specific study, this technique cannot take into account moral, ethical, or political factors that are crucial in determining youth program policy and funding.
Any benefit-cost analysis must consider several key issues. What counts as a benefit? A cost? How can one measure their monetary value? If a benefit or cost is not measurable in monetary terms, how can it enter the analysis? How can one extrapolate benefits or costs after a youth leaves a program and any follow-up period when impact data are gathered? The costs of youth programs mostly occur at the outset, while the benefits may be realized many years later. How should benefits and costs at different times be valued to reflect the fact that a dollar of benefit received in the far future is worth less a dollar received in the near future, and that both are worth less than a dollar of cost incurred in the present? How can one assess benefits and costs to youth who participate in the
program, to taxpayers or other program funders, and to society as a whole? Persons who bear the costs of a program may well differ from those who share in the benefits. How can one incorporate these distributional impacts into the analysis? An enormous literature has arisen to address these issues. There are several excellent texts on the subject (e.g., Boardman et al., 1996; Zerbe and Dively, 1997).
If the principal benefit expected from a youth program cannot be given monetary values, cost-effectiveness analysis can be an alternative to benefit-cost analysis (Boardman et al., 1996). Suppose, for example, that the primary goal is to increase volunteer activity in community groups and that other possible program impacts are of little import to decision makers. In such a case, programs might be compared in terms of the number of volunteer hours they inspire per dollar of cost. Decision makers will want to fund the program that produces the largest increase in hours per dollar spent.
Focusing on one goal is a strength in that it obviates the need to express the value of the outcome in monetary terms. Yet when interventions have multiple goals and no one has clear priority, cost-effectiveness data may not offer much guidance. If one youth program increases voluntary activity by 20 percent and reduces drug use by 15 percent and an alternative, equally costly program has an increase of 12 percent and a reduction of 20 percent, which is better? When there are multiple types of benefits, none of which dominates, and when some can be cast in monetary terms, a benefit-cost analysis that considers both monetary and nonmonetary benefits will usually provide more useful information than a cost-effectiveness analysis.
Systematic benefit-cost analysis has hardly been applied to youth programs, except for those likely to reduce juvenile crime (Aos et al., 2001). While application of this methodology to youth development programs is complex, it is no more so than in other areas of social policy, in which it has made significant contributions to research and policy analysis (e.g., health and mental health, early childhood education, job training, welfare-to-work programs). To advance youth program evaluation in this direction will require more rigorous evaluations with adequate follow-up periods and suitable data on a broad set of impacts.
Analysts and practitioners may be concerned that benefit-cost analysis will lead decision makers to focus too narrowly on financial values and downplay or ignore other important program impacts that cannot be translated into financial terms. However, a careful analysis will discuss nonmonetary benefits and emphasize that a complete assess-
ment of programs when important social values are at stake, such as in the area of youth development, must weigh such benefits along with the monetary ones. Analyses that fail to do so can be criticized for presenting an incomplete picture.
As with other evaluation methods, any benefit-cost analysis has limitations. It can be questioned because its results rest on judgments about which impacts to quantify and various other assumptions needed to conduct an analysis. Time and resource constraints prevent investigation of all possible benefits and costs. Some effects may be inherently unquantifiable or impossible to assess in financial terms yet considered crucial to a program’s success or political viability. Nonetheless, when carefully done with attention to the findings’ sensitivity to different assumptions, benefit-cost analysis can improve the basis on which youth development policy decisions rest.
In this chapter, we reviewed fundamentals of evaluation and important questions for the development of a comprehensive evaluation strategy. Several conclusions emerge from this discussion.
First, there are many different questions that can be asked about a program. A priority for program practitioners, policy makers, program evaluators, and other studying programs is to determine the most important questions and the most useful methods to evaluate each program. It is very difficult to understand every aspect of a program in a single evaluation study. Like other forms of research, evaluation is cumulative.
The committee identified six fundamental questions that should be considered in comprehensive evaluation:
Is the theory of the program that is being evaluated explicit and plausible?
How well has the program theory been implemented in the sites studied?
In general, is the program effective and, in particular, is it effective with specific subpopulations of young people?
Whether it is or is not effective, why is this the case?
What is the value of the program?
What recommendations about action should be made?
While it is difficult to answer all six questions well in one study,
multiple studies and evaluations could be expected to address all of these questions. Comprehensive evaluation requires asking and answering many of these questions through various methods. Opinions differ opinions among program stakeholders (e.g., service practitioners, researchers, policy makers, and funders) about the most appropriate and useful methods for the evaluation of community programs for youth. No specific evaluation method is well suited to address every important question. And while there is tension between different approaches, the committee agrees that there are circumstances that are appropriate for the use of each of these methods. The method used depends primarily on the program’s maturity and the question being asked. It is rare to find programs that involve comprehensive evaluations, and they are probably most warranted with really mature programs that many people are interested in.
The committee concluded that studying program effectiveness should be a regular part of all programs. Also, not all programs require the most extensive comprehensive experimental evaluation outlined in this chapter. In order to generate the kind of information about community programs for youth needed to justify large-scale expenditures on programs and to further fundamental understanding of role of community programs in youth development, comprehensive experimental program evaluations should be used when:
the object of study is a program component that repeatedly occurs across many of the organizations currently providing community services to youth;
an established national organization provides the program being evaluated through many local affiliates; and
theoretically sound ideas for a new demonstration program or project emerge, and pilot work indicates that these ideas can be implemented in other contexts.
Such evaluations need to pay special attention to the individual- and community-level factors that influence the effectiveness of various practices and programs with particular individuals and particular communities.
The committee also discussed the need for more ongoing collaborative teams of practitioners, policy makers, and researchers/theoreticians in program design and evaluation. We conclude from case study materials on high-quality comprehensive evaluation efforts that the odds of
putting together a successful high-quality comprehensive evaluation are increased if there is an ongoing collaboration between researchers, policy makers, and practitioners. Yet such collaborations are hard to create and maintain.
When experiments are not called for, a variety of nonexperimental methods and more focused experimental and quasi-experimental studies are ways to understand and assess these types of community programs for youth and help program planners and program staff build internal knowledge and skills and can highlight theoretical issues about the developmental qualities of programs. Such systematic program study should be a regular part of program operation.
Comprehensive evaluation is dependent on the availability, accessibility, and quality of both data about the population of young people who participate in these programs and instruments to track aspects of youth development at the community and program levels. The next chapter explores social indicators and data instruments to support these needs.