Approaches for accurately evaluating programs and drawing valid causal inferences about them are the heart of the challenge of assessing the costs and benefits of early childhood interventions. Jens Ludwig and David Deming examined two aspects of the methodological challenges of designing evaluations: (1) drawing causal inferences from studies whose designs deviate from the ideal and (2) identifying individual or group effects in analyses that include multiple outcomes and multiple groups.
MAKING CAUSAL INFERENCES
Before one can assign dollar values to a program’s benefits and costs, one must first make sound estimates of those benefits and costs in the natural units with which they are normally measured, Ludwig explained. Doing so requires evidence of causal relationships, ideally collected through randomized experiments—although, as Karoly had already noted, this is not always possible. Ludwig discussed some of the methodological challenges that arise when randomized experiments deviate from the ideal design, as they often do in the real world. He also discussed alternative options for estimating causal relationships that can be used when strictly randomized experiments are not feasible.
An example of some of these real-world challenges can be observed in a recent experimental study of 383 oversubscribed Head Start centers, which began in 2002 (Puma et al., 2005; see Box 2-1 for information about
Head Start is a federally funded school readiness program serving low-income families with young children. Created in 1965, the program focuses on preparing disadvantaged 3-and 4-year-olds for school by providing them with early education and providing their families with support in health, nutrition, and parenting. The services are supported with federal funds and delivered through locally based centers. Studies of outcomes for children who have received Head Start services show benefits that include improved performance on cognitive and academic achievement tests; increased earnings, employment, and family stability; and decreases in use of welfare and involvement with crime.
SOURCE: National Head Start Association (2009).
the Head Start Program). Ludwig noted that this study was designed to be nationally representative, with children who applied to Head Start but were not accepted serving as controls. However, as with most randomized trials, real-world complications have affected the progress of the study. Some of the participants dropped out of the study, and others may not have complied with all of its conditions—not responding to survey questions, for example. Ludwig pointed out that although the response rates in the Head Start study have been good, particularly considering that the program population is very disadvantaged, it is important to ask whether the level of attrition is sufficient to raise cautions about the causal inferences the study was designed to support.
There are several ways to approach that question. One would be to compare the baseline characteristics for the treatment and control groups. However, reassurance that they were basically similar would not provide a complete answer. Each of the baseline characteristics would have its own confidence interval (a measure of the degree of confidence one can have in the value identified, based on sample size and other factors), which suggests some uncertainty about their relative importance in explaining differences across groups and outcomes. In other words, some will be more relevant than others. To address this concern, Ludwig explained, one might use regression adjustment to examine how the estimates change when one does not control for observable baseline characteristics.
A further complication, however, is that some of the attrition in participation could be the result of factors that change over time, after the baseline characteristics are identified. A strategy for addressing that con-
cern is to consider what statisticians call worst-case bounds, which is a way of highlighting the possible effects of systematic patterns in the data that are missing on the inferences one might make from the results. To do this analysis, one first examines the best-case scenario: that all the children who received the intervention would have the best possible outcomes, and that all in the control groups would have the worst possible outcomes. By then examining the opposite assumption—that the treatment group does as poorly as possible and the control group does as well as possible—one can then see the full range of possible error.
A related concern is that not everyone selected into the treatment group will choose to participate in the Head Start Program from the beginning. The cleanest solution to that problem is to examine outcomes for everyone randomly assigned to the treatment group, regardless of whether they participate in Head Start—the “intent-to-treat” group. This approach will make it possible to identify the effect of being offered Head Start, but it will lead to an underestimate of the effects of actually participating in Head Start (because some children assigned to the treatment group do not participate). This point is often lost in policy discussions of the Head Start study results, Ludwig observed.
It is also important to be sure to compare “apples” to “apples” in comparing impacts across programs, or when doing benefit-cost analysis. In such programs as the Perry Preschool Project, for example, almost everyone assigned to the treatment group participated, so one would need to compare its effects to the effects of actually participating in Head Start (the “effects of treatment on the treated”) to obtain a useful comparison. Similarly, one would not want to use the intent-to-treat effect to assess the costs of actually participating in Head Start. The fact that many children in the control group are likely to participate in some other program (rather than receive no treatment at all) further complicates the effort to carry out benefit-cost analysis.
Participation rates vary across programs and studies, and it is important to consider this point when results are compared, so that the comparisons are valid. Moreover, Ludwig emphasized, if the value of the benefits has been analyzed in terms of the entire group selected for treatment (the intent-to-treat group) but the cost estimates are based on the actual costs of enrolling a child in the program, the benefit-cost analysis will be skewed. The fact that the characteristics of those who do and do not participate may vary, in both the control and the treatment groups, also complicates the analysis.
Ludwig also explored the question of what alternatives to randomized experiments exist in a world in which true randomization is seldom possible. He noted that discussion of this issue often takes on an “almost theological flavor,” yet it is possible in many cases to figure out the extent of selec-
tion bias in studies that are not randomized. He cited a study in which researchers compared the results of two methods. They conducted a rigorous randomized trial and then, separately, used nonexperimental data and estimation methods to evaluate the same intervention (a job training program) (LaLonde, 1986). Others have used this method in different contexts and, like LaLonde, have found substantial differences between the experimental and nonexperimental results (e.g., Dehejia and Wahba, 2002). This approach has also shown that the selection bias is likely to be application-specific—that is, it is likely to depend on the selection process for the particular program and on the quality of the data available for that program. Ludwig said that using the approach suggested by LaLonde to analyze the experimental data from the federal government’s recent Head Start study would provide valuable information about the potential biases that may affect nonexperimental estimates in the early childhood education area.
One particularly promising nonexperimental approach to estimating effects is based on the idea that “nature does not make jumps,” so that unusual patterns in program data are likely to indicate an effect. This approach, called regression discontinuity, has become increasingly common in the evaluation of early childhood interventions. Ludwig explained that studies using the LaLonde method suggest that regression discontinuity comes close to replicating the results that a true experimental study would yield. To illustrate, Ludwig used data from the early years of Head Start, which he and a colleague had analyzed (Ludwig and Miller, 2007). They noted that the counties that received early Head Start grants were those with the highest countywide poverty rates, and that no other programs at the time were addressing young children’s health risks in those or other poor counties. Thus, it is reasonable to assume that, apart from Head Start, children’s health outcomes should vary smoothly with respect to the baseline poverty rate. When they examined child mortality data from the period, they found that counties’ child mortality rates increased along with their poverty rates up to the threshold used in awarding the Head Start grants. So the counties that received the grants saw much lower child mortality rates than the counties just above them in terms of poverty rate—an effect that clearly suggests that Head Start helped to reduce child mortality.
This approach has been used in numerous studies to estimate the effects of universal prekindergarten programs. Ludwig suggested that, for optimal results, researchers should use “experimental thinking” in designing an evaluation, even when randomization is not possible. To illustrate this type of thought experiment, Ludwig described data from a study of a pre-K program in Tulsa, Oklahoma (Gormley et al., 2005). The researchers compared the impacts for children whose dates of birth were
close but fell on either side of a cutoff for enrollment in the pre-K program. The authors assumed that the process through which the participating families choose to participate in pre-K was similar for children in both groups, but in fact the two groups of families made their decisions in two different years and thus faced different choices about alternatives. The pre-K program also may have changed from one year to the next. Thus, he explained, post facto analysis of the data yielded notably different results depending on which samples were examined. A potentially easy fix to this concern would come from collecting data on all children with dates of birth around the eligibility cutoff, which would be analogous to the sort of intent-to-treat analysis that is common in work with true randomized experimental designs.
Ludwig closed with the observation that program evaluation in early childhood education has come a long way since the early days of Head Start. The program began in 1965, and by 1966 the first study suggesting that it did not work was published. That study was a simple regression cross-section that compared participants and nonparticipants. In the 1990s, Currie and Thomas (1995) pushed the field forward significantly by using sibling comparisons to examine subtle differences among these groups, and a 2002 study was the first nationally representative study of the program (Garces, Thomas, and Currie, 2002). Ludwig believes that further refinements to the technology—as well as further analysis of existing Head Start data held by the U.S. Department of Health and Human Services, which workshop participants noted has been difficult to obtain—would be valuable.
EXAMINING MULTIPLE INFERENCES
Early childhood interventions are intended to have impacts in a number of domains—education, earnings, crime, and health, for example. Treatment effects also vary for subgroups that differ by gender, race, socioeconomic status, and other factors. Thus, common sense demands that studies examine multiple factors at a time, but the more factors that are included in analyses, the greater the chance for error—particularly a false positive, or Type I error. David Deming focused his presentation on multiple inference adjustments, which are strategies for accurately identifying individual effects in the context of an analysis that covers multiple outcomes and multiple groups.
There are two general approaches to this problem, Deming explained. The first is to create a summary index by combining various outcomes that reduces the number of tests that are part of the experiment. To illustrate, Deming used a study by Anderson (2008) of data for three prominent early childhood interventions—the Carolina Abecedarian Project,
the Perry Preschool Project, and the Early Training Project (see Box 2-2). Anderson grouped outcomes that are evident at different life stages (preteen, teen, and adult years) together, and he did the same for different categories of outcomes (e.g., employment and earnings, physical health). This way of grouping the data—by time of life—made it possible for him to standardize and compare the results across a number of studies.
The second approach, which can be combined with the first, is to adjust the probability p-values (calculations of the probability that the data indicates a significant difference) to correct for the fact that the likelihood of a false positive increases with the number of factors being analyzed. This can be done using the Bonferroni method or a method called free step-down resampling; the purpose of the latter is to account for the dependence structure in outcomes. For example, Deming explained, if two outcomes, such as high school graduation and college attendance, are most likely to be correlated but are treated as independent, some information will be lost if the probability value is not adjusted. Anderson (2008) used an approach that standardized each variable to have a mean of 0 and a standard deviation of 1 and then put them together on a single scale. The resulting effects are fairly large, Deming explained.
Making an adjustment of this sort can yield a big difference in the outcome of an analysis, but the results depend on which outcomes are covered in data collection as well as the decision about which outcomes to include in the summary index. Even more complex are the decisions about what data to include if one is attempting to use this procedure in analyzing results across several studies—so it is important to identify precise standards for selecting the studies to include, rather than simply using those that are best known, for example.
Deming closed with the point that the academic questions about the pros and cons of different statistical procedures do not always line up well with important policy questions. Different analytical approaches may yield different results, as happened, for example, with two different benefit-cost studies of the Perry Preschool Project, which produced varying results. One found larger effects for girls, and the other found larger effects for boys. But, Deming suggested, the reason for this related to the ways the two studies balanced interest in the certainty that the effects occurred at all and the magnitude of the possible effects.
His view is that while both are important, in the end, policy makers may need to be comfortable with some degree of uncertainty. If the data are adequate to demonstrate that benefits accrue from a particular program, even though it is not clear whether they will yield $100,000 or $400,000 in value, then the benefit alone may be sufficient to support proceeding with a relatively low-cost program. Even though there are many ways of clustering the data, he suggested, the potential costs and benefits
Three Early Childhood Interventions
The Carolina Abecedarian Project
The Carolina Abecedarian Project was a research study of the potential benefits of providing educational interventions to low-income children in a child care setting. The researchers identified a group of children at high risk of impaired cognitive development and randomly assigned them to receive (or not) an intensive preschool intervention between 1972 and 1977. The preschool intervention focused on social, emotional, and cognitive development, with a particular focus on language, in a full-time, year-round program. The children were followed until they reached age 21, and those enrolled in the preschool intervention showed lasting gains in IQ and mathematics and reading achievement.
SOURCES: Masse and Barnett (2002);FPG Child Development Institute (2009a).
The Perry Preschool Project
The Perry Preschool Project was a study of the effects of high-quality care and education on low-income 3- and 4-year-olds conducted by the HighScope Educational Research Foundation. This four-decade study began, in 1962, with the identification of 123 children living in poverty in Ypsilanti, Michigan, half of whom were randomly assigned to receive an intensive preschool intervention that included home visits and full-time care and education (there were intentional departures from true randomization, so this was not technically a randomized controlled trial). Followed until they reached age 40, the children demonstrated numerous lasting economic and other benefits to the treatment, including higher scores on achievement and other tests, higher high school graduation rates, higher employment rates and earnings, and lower rates of involvement in crime.
SOURCE: High Scope Educational Research Foundation (2009).
The Early Training Project
Begun in the 1960s, the Early Training Project was a study of the effects of an intervention designed to improve the educational achievement of disadvantaged children. The children were randomly selected to participate in a 10-week, part-day preschool program during the summer and to receive weekly home visits throughout the year. The random assignment evaluation showed gains for program participants in IQ, vocabulary, and reading, although some of the benefits appeared to fade over time.
SOURCE: Gray and Klaus (1970).
provide clues to the most sensible way to conduct the statistical analysis, by indicating the priority that should be assigned to various questions. The possible differences in outcomes for 3- and 4-year-olds may be out-weighed by the outcomes that are evident when data for these two groups
of children are lumped together. Thus, it might be worse to fail to adopt an intervention that could significantly affect crime rates than to risk wasting a relatively small amount of money on a program that does not turn out to be effective. Participants reinforced this view. One commented that standards for Type 1 errors can be very high in studies of early childhood interventions, wondering “why are we so afraid that we might find an effect? We are giving pretty broad latitude to the possibility that there are meaningful effects that aren’t passing the statistical tests.”