4
Evaluation Methods and Issues

Chapter 3 identified three types of questions: monitoring the well-being of the low-income population, tracking and documenting what types of programs states and localities have actually implemented, and formally evaluating the effects of welfare reform relative to a counterfactual. In this chapter we consider only the last of these types of questions. The proper methods for conducting monitoring studies and for determining what policies have actually been implemented are primarily data collection issues; there is no evaluation methodology component to these questions. They are discussed in Chapter 5 in connection with data issues.

This chapter has five sections. In the first, we provide an overview of evaluation methodologies (discussed in more detail in our interim report [National Research Council, 1999] and in evaluation texts). The second section discusses the relative advantages and disadvantages of alternative evaluation methods in relation to each of the different evaluation questions of interest identified in Chapter 3. The third part of the chapter discusses several specific evaluation methodology issues in more detail: the reliability of nonexperimental evaluation methods, statistical power in nonexperimental methods, generalizability, process and qualitative research methods to complement formal evaluation analyses, and the importance of welfare dynamics for evaluation. The fourth part assesses the evaluation projects currently under way (discussed in Chapter 2) in light of the findings that have been presented on the different evaluation methods. The final part of the chapter briefly considers ways in which federal and state agencies can improve evaluations of welfare reform.

We note that recommendations of appropriate evaluation methodologies are sometimes influenced by data availability, for the two are necessarily intertwined.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 54
Evaluating Welfare Reform in an Era of Transition 4 Evaluation Methods and Issues Chapter 3 identified three types of questions: monitoring the well-being of the low-income population, tracking and documenting what types of programs states and localities have actually implemented, and formally evaluating the effects of welfare reform relative to a counterfactual. In this chapter we consider only the last of these types of questions. The proper methods for conducting monitoring studies and for determining what policies have actually been implemented are primarily data collection issues; there is no evaluation methodology component to these questions. They are discussed in Chapter 5 in connection with data issues. This chapter has five sections. In the first, we provide an overview of evaluation methodologies (discussed in more detail in our interim report [National Research Council, 1999] and in evaluation texts). The second section discusses the relative advantages and disadvantages of alternative evaluation methods in relation to each of the different evaluation questions of interest identified in Chapter 3. The third part of the chapter discusses several specific evaluation methodology issues in more detail: the reliability of nonexperimental evaluation methods, statistical power in nonexperimental methods, generalizability, process and qualitative research methods to complement formal evaluation analyses, and the importance of welfare dynamics for evaluation. The fourth part assesses the evaluation projects currently under way (discussed in Chapter 2) in light of the findings that have been presented on the different evaluation methods. The final part of the chapter briefly considers ways in which federal and state agencies can improve evaluations of welfare reform. We note that recommendations of appropriate evaluation methodologies are sometimes influenced by data availability, for the two are necessarily intertwined.

OCR for page 54
Evaluating Welfare Reform in an Era of Transition Chapter 5 presents our major discussion of data issues and data needs; it follows this chapter because data needs should be dictated by what is needed for evaluation. However, a discussion of the strengths and weaknesses of evaluation methods is inevitably influenced by the types of data currently available, likely to be available, or remotely possible to collect; in that context, data issues do arise in this chapter. OVERVIEW OF EVALUATION METHODS Formal evaluation studies are those that attempt to estimate the “effect” of a policy change, or the impact of the change on those outcomes which are of interest. By common usage of the word “effect,” this implies that it must be determined what would have happened to those outcomes if the policy change had not occurred. Thus, a formal evaluation study requires the estimation of two quantities: the outcomes that have actually occurred following a policy change, and those that would have occurred if the policy had not changed. The latter is called the “counterfactual.” The basic difficulty in all evaluation studies is that the counterfactual is not naturally or directly observed—it is impossible to know with certainty what would have happened if a policy change had not occurred.1 All evaluation methodologies attempt, implicitly or explicitly, in one way or the other, to estimate those counterfactual outcomes. In experimental methods, the outcomes are estimated by means of a control group to which individuals have been randomly assigned. In nonexperimental methods, the outcomes are estimated by means of a comparison group, a group of individuals that are not randomly assigned to a comparison group, but who are considered to be similar to those who received the policy. The different types of policy alternatives of interest to different audiences all fit within the counterfactual conceptual framework. Comparing PRWORA in its entirety to its precursor, AFDC, constitutes one pair of policy alternatives, for example, that we concluded would be of interest to many observers. Comparing PRWORA in its present form to a modified PRWORA that might result if its components were altered or improved in some way, constitutes another pair of alternatives in which many policy makers and others are interested. Sometimes, three alternatives are considered, such as the case when the goal is to compare (say) two alternatives (Policy A and Policy B) to current policy. Most of the general issues we discuss for different evaluation methods are the same regardless of which of these policy comparisons is of interest. 1   A recent discussion of the counterfactual approach can be found in Dawid (2000), with several commentaries to this article in the June 2000 issue of the Journal of the American Statistical Association.

OCR for page 54
Evaluating Welfare Reform in an Era of Transition Experimental Methods Randomized experiments have a long history in welfare reform evaluation and have produced some of the most influential results in past reform eras. The strength of the experimental method is that it has a high degree of credibility because randomization assures that those who do experience the policy change (the experimental group) are alike, in all important ways, to those who do not experience it (the control group), except for the difference in treatment (the policy) itself. In evaluation terminology, well-run and well-conducted experiments have strong “internal validity” because they have considerable credibility in generating correct estimates of the true effect of the policy tested in the location and on the population of individuals enrolled in the experiment. The experimental method is also influential because it is simple and easy to understand by policy makers.2 Despite these strengths, the experimental method has weaknesses as well (see Burtless [1995] and Heckman and Smith [1995] for discussions of these issues). A common weakness is that the results of the experiment may not generalize to types of individuals other than those enrolled in the experiment, or to different areas with different economic and programmatic environments, or to policies that differ slightly from those tested in the experiment. In evaluation terminology this is the “external validity” problem. The severity of this problem can be reduced if a large number of experiments are conducted in multiple sites, on different populations, and with different policy features. The expense of doing so is generally prohibitive. A related problem is that experiments are ill suited to estimating the effects of large-scale policy changes which are intended to change the entire culture of a welfare system. If the program is only tested on a small group of individuals in a few cities, the culture will not be affected. However, if the program is enacted nationwide with only a small group of individuals still subject to the old program as a control group, the cultural effects will occur but they will also affect the control group. The experimental method is also not well positioned to estimate so-called “entry effects,” effects that occur because a policy change affects the likelihood of becoming a welfare recipient in the first place. This problem may occur because most welfare experiments draw their experimental and control samples from welfare recipients and not from individuals who are not currently receiving welfare, but who may later do so. A somewhat related problem with experiments is that they usually take a relatively long time to design and implement, so that the policy change tested in the experi- 2   Of course, experiments can be conducted badly by incorrectly conducting the randomization, erroneously assigning treatment status, allowing some of the control members to receive the treatment, by high nonresponse rates or missing data that biases the results, or a number of other problems in implementation. See Gordon et al. (1996) for a study that found that many of the pre-PRWORA waiver experiments suffered from inadequate sample sizes, and cross-overs, contamination, and control group exposure, for example.

OCR for page 54
Evaluating Welfare Reform in an Era of Transition ment may not be of interest to policy makers by the time the results are completed. Finally, experiments often have practical difficulties when they are conducted in real-world environments with on-going programs and when they require the cooperation and effort of agencies engaged in running current programs. Despite these weaknesses, the strengths of experiments for answering some types of questions cannot be overemphasized. Even if they may not be completely generalizable and even if they do not always capture all the relevant effects of the program, they provide more credible evidence than other methods for the effects of the programs in one location and on one population. In a policy environment where little credible evaluation research is available, even a small number of experimental results can contribute a great deal to knowledge. Nonexperimental Methods Nonexperimental methods are more diverse and heterogeneous than experimental methods, which is one of the reasons that there is often confusion about their nature and value. In all cases, however, nonexperimental methods require that the outcomes experienced by a group of individuals after a policy change be compared with the outcomes that occur for some other group—the comparison group that did not experience the policy change. A key difference between experimental methods and nonexperimental methods is that an experiment implements a particular policy or program and therefore ensures that it is exactly the one of interest (although, as noted, the time lag in obtaining results may significantly reduce this advantage, and, like nonexperimental studies, the exact policy of interest may not actually be implemented as intended). Nonexperimental methods are necessarily more passive—they can only estimate the effects of programs and policy changes that have actually been implemented, which may not be those of greatest interest. This approach can be advantageous, however, if a wide variety of policy changes have been implemented in different areas, in different environments, and at different times, because a wider range of policies can be studied and generalization is easier. In evaluation terms, nonexperimental evaluations, if they make use of this range, have a greater potential for external validity than do experiments. This potential for external validity must be balanced against the weaker internal validity of nonexperimental methods—that is, the risk that the comparison group is not comparable to the group receiving the policy or program so the correct effects are not estimated. There are several generic types of nonexperimental evaluations used in welfare evaluations. Perhaps the most traditional is a “cross-area” comparison, which compares outcomes of similar individuals in different geographical areas where different types of policies have been implemented and attributes differences in their outcomes to the differences in policy. A variation on this approach, which is still essentially cross-area in nature, follows individuals over time in

OCR for page 54
Evaluating Welfare Reform in an Era of Transition different areas where policy is changing in different ways and observes the outcomes across areas. All these methods can be formulated as econometric models in which individual differences in characteristics are controlled for statistically. This method is not available if all areas are affected by the same policy changes at the same time. In addition, even when there is cross-area variation, there is some danger that not all relevant differences in states’ outcomes, either at a given time or over time, are controlled for; omitted state differences may be correlated with policy choices, either by chance or by design.3 Another, cruder evaluation method is a pure time-series analysis—also called an interrupted time series or before-and-after method—which examines the pattern of outcomes for a group of individuals before and after a policy change. For this method, the “comparison group” is simply the population prior to the policy change. This approach can be implemented either with aggregate data or with micro data—that is, data at the level of the individual or family. In the latter case, the data follow the individuals or families over time before and after a policy change to see how outcomes change. These are among the weakest nonexperimental methods because outcomes change over time for many reasons other than the policy change (for example, changes in the economy and in other policies) which are difficult to control for fully. Outcomes may also change for a given cohort of individuals simply because those individuals age. However, the cohort comparison method (or simply the use of aggregate data, which implicitly uses different cohorts) circumvents this problem by examining a population at the same age at each point in time. The cohort comparison method examines the outcomes over time of multiple groups of individuals (cohorts) who experience different policies because policy is changing over time.4 If the analysis is conducted in only one area, or in the nation as a whole, the method is essentially a time series. It differs from pure time-series analysis only inasmuch as the cohorts are assumed to be alike in other respects because they are of the same age or are on welfare at the same time. The cohort comparison method can be combined with the cross-area method by comparing changes for different cohorts in different areas where policy has been changing, leading to a cross-area cohort comparison method. An issue in the cohort comparison method when applied to welfare reform concerns how the cohorts should be defined. If two cohorts are drawn from the welfare rolls at different times—say, one cohort before the legislation and one after—there is a danger that the two cohorts are noncomparable. Noncompara- 3   The latter is a case of what is known as “policy endogeneity” and occurs when different policies are chosen by different states on the basis of the populations and their outcomes in the state—the same outcomes that are examined to assess the effects of policies. 4   Cohort comparison methods were used in the evaluation of the 1981 AFDC reforms (Research Triangle Institute, 1983; U.S. General Accounting Office, 1984); both studies examined the exit rates and outcomes of a pre-1981 cohort and a post-1981 cohort.

OCR for page 54
Evaluating Welfare Reform in an Era of Transition bility can arise if (for example) the caseload is falling and those still receiving welfare after the legislation goes into effect are different—for example, more disadvantaged—than those in the first cohort. The exit rate of the second, more disadvantaged cohort is likely to be lower than the first cohort. This lower exit rate is not due to a policy change, but rather because the cohort of welfare recipients has itself changed. This difference can make it difficult to distinguish “true” effects of the legislation on exit rates—that is, whether it really does cause a given recipient to leave welfare sooner than she would have otherwise—from spurious “selection” effects, which arise if the exit rate in the second cohort differs from the first solely because of differences in the make-up of the caseloads. Another set of nonexperimental methods enjoying some popularity are “difference-in-difference” methods. This method compares the evolution of outcomes over time for different individuals in the same area where a single policy change has occurred, but for which some individuals are in a position to be affected by the change while others are not (Meyer, 1995). Those assumed not to be affected by the policy change constitute the comparison group. In most implementations of the method in welfare reform evaluations, the comparison group is chosen to be a group of individuals ineligible for welfare, or at least ineligible for the policy change in question. Common comparison groups are single women without children, married women with or without children, and men, groups that are mostly ineligible for AFDC or TANF. Sometimes single mothers who are more educated and hence of higher income are used as a comparison group for low-income single mothers because the former group is generally ineligible for welfare. The key assumption in the method is that the evolution of outcomes of the group affected by the policy change (e.g., single mothers) would be the same as that of the comparison group in the absence of the policy change. The major threat to the credibility of this method is that the two groups are sufficiently different in their observed and unobserved characteristics (although observed characteristics can be controlled for) that these differences, and not the policy difference, account for the differences in outcomes. Another nonexperimental evaluation method that is quite similar to the difference-in-difference method, but that is implemented quite differently and actually predates it, is the method of matching.5 In this method, comparisons are made within given areas between those who are directly affected by a new reform and a comparison group of individuals (or sometimes populations) who are, for one reason or another, not directly affected. Although in principle the types of individuals used to construct a comparison group could be quite similar to those just mentioned for the difference-in-difference method, in practice the method of 5   Although matching has a long history in program evaluation, a variant that has received attention more recently is that of the propensity score, which bases the match only on the predicted probability of participation. See Rosenbaum and Rubin (1983) for the initial article and see Hahn (1998) and Heckman et al. (1997, 1998) for recent contributions in the econometrics literature on this method.

OCR for page 54
Evaluating Welfare Reform in an Era of Transition matching follows the exact opposite strategy of seeking a comparison group that is as similar in observed characteristics to the affected group as possible. Typically, the group will be drawn from the population of eligibles (usually those not participating in the program) rather than the population of ineligibles, as in the difference-in-difference method. The two groups are matched on observable characteristics (age, education, earnings and welfare history, geographic location, etc.) to eliminate differences resulting from those factors. Like the difference-in-difference method, this method can be implemented in a single area with a single policy change, and does not require cross-area or over-time variation in policy in order to estimate effects. Also like the difference-in-difference method, the major threat to the matching method is that there are unmeasured characteristics that differ between the two groups and related to the reason that one group was subjected to the policy and the other was not. Because there is no policy variation per se—all individuals reside in the same area, under a single policy—comparison groups have to be constructed from individuals who are, for example, not on welfare, or who are on welfare but are exempted from the new reform by reason of some characteristic they possess (e.g., very young children). Learning whether the members of the comparison group are really comparable to those who were made subject to the new policy—in the sense of having the same outcomes as they would have had in the absence of the policy—is difficult. The major disadvantage of nonexperimental methods in that it is difficult to assess the degree of bias in the estimates of a policy’s or a program’s effects because of threats to internal validity from the choice of a comparison group. This problem has been given extensive attention in the research literature on nonexperimental evaluation methods. The most convincing approach is simply to conduct formal sensitivity analyses that reveal how different degrees of bias that are thought to be present, on a priori grounds or on the basis of other information, affect the estimates of program effects.6 The magnitude of the effect of a policy change is also important because any given amount of bias is less likely to affect the sign (positive or negative) and policy importance of the estimate if the magnitude is large. This truism underlies the common supposition that nonexperimental methods have greater credibility in cases in which a large effect of the program under study is expected and less credibility in cases when a small effect is likely, for in the latter case it is more likely that any bias in the estimate will swamp the true effect. 6   Some practitioners have proposed that program effects could be estimated using more than one of the available nonexperimental methods and then compared across methods. The presumption is that if similar estimates are obtained across each method, then the estimates are credible. Unfortunately, there is no scientific basis for this approach because the threats to internal validity for each method are different.

OCR for page 54
Evaluating Welfare Reform in an Era of Transition Despite the threats to internal validity in all nonexperimental methods, they can be very useful when carefully implemented. Pure time-series methods, the crudest of the nonexperimental approaches, are useful as a descriptive piece of evaluation showing whether, given the other changes over time that can be controlled for, a policy change is correlated with a deviation from the trend in outcomes. Cross-area methods have credibility when the groups of individuals examined in each area are strongly affected by policy, when the policy measures in each area are adequately measured, and when there is a reasonable judgment that the existence of different policy changes is not correlated with outcomes. Difference-in-difference methods have some credibility, particularly for large systemwide changes. Only matching methods suffer from an inherent inability to judge credibility, because they depend on untestable assumptions about unobserved characteristics (we discuss some methods for testing the validity of the matching method below). For all nonexperimental methods, credibility is increased if the expected magnitude of the effects is large. The major advantage of nonexperimental methods is that they have greater generalizability, across a great diversity of areas and population groups, than experiments. These methods can also be used to capture entry effects. Nonexperimental methods are, therefore, a necessary part of welfare reform evaluation. Process Analysis and Qualitative Methods Implementation and process analyses collect information on the implementation of policy changes; how those changes are operationalized within agencies, often at the local level; what kinds of services actually get delivered and how they get delivered; and, sometimes, how clients perceive the services. They can be used in conjunction with either experimental or nonexperimental analyses, although analysts disagree about their role in formal evaluations. At one level, they can be seen merely as providing a more accurate description of what is being evaluated: in evaluation language, they provide a more precise description of the policy treatment. This is, indeed, the way many process and implementation evaluations are used.7 A more ambitious role for process and implementation analyses is to assess the effects of experimentally varied or nonexperimentally observed differences in policy implementations. For example, one could have a randomized trial in which the actual policy or program treatment offered is the same for both experimental and control groups, but for which the implementation differs. Or, in a nonexperimental analysis that correlates program variation with outcomes across a number of areas, measures of implementation might be used to characterize each area’s program, in addition to the formal program descriptions. The effects 7   See Corbett and Lennon (forthcoming).

OCR for page 54
Evaluating Welfare Reform in an Era of Transition of implementation differences could then be estimated. This more ambitious goal has not been attempted in any systematic way in welfare reform evaluations thus far, partly because of the difficulty in constructing measures of implementation that are comparable across areas (see below).8 Implementation and process analysis can also play an important informal role in interpreting the estimated effects from analysis of the official treatment. Implementation and process analysis can reveal features of a program that have been carried out successfully—in the way the program designers intended and expected them to be carried out—and it can also reveal features that are not carried out successfully. If difficulties or failures of implementation are found, these can be used to consider why the estimated effect of a program was larger or smaller than anticipated or even why the program had no apparent effect. This interpretative, hypothesis-generating function of implementation and process analysis can be quite valuable when formal effect estimates come out differently than expected.9 More broadly, qualitative methods can be used not only as a method of collecting information for process and implementation analyses, but also to study the behavior of individuals and families. Qualitative methods may involve collecting data through focus groups, semi-structured interviews (sometimes longitudinally with the same individuals or families over time), open-ended questions in surveys, and ethnographic observations of individuals (Newman, 2001). A process study may also use one of these methods to collect data on how caseworkers are implementing a particular policy. Qualitative methods used to collect information on individuals and families can serve multiple roles in evaluation settings and in a different dimension than process and implementation analysis. To some extent, such data may simply provide a better measure of outcomes than data collected through formal survey or administrative data outcome measurements because they provide much more in-depth information on how individuals and families are affected. In principle, it is possible that formal evaluations of different programs could yield similar estimates of outcomes but that quite different outcomes would be found with the qualitative analysis. This provides a valuable insight into how seemingly similar program effects are not the same. In addition, like implementation and process analysis, qualitative data can also provide insights into the precise mechanism by which policy affects individuals’ lives (or, perhaps more commonly, fails to 8   Although this approach has not been formalized, many process evaluations interpret their results in a causal way, that is, they argue implicitly that the outcomes generated by the program would have been different if implementation had been different. However, without the types of formal comparisons described here, such inferences do not have a strong basis. 9   In addition, the discoveries concerning the reasons for program outcomes obtained in this way can lead to new program innovations which can subsequently be tested with evaluations that again contain qualitative components, leading to a cycle of evaluation and discovery.

OCR for page 54
Evaluating Welfare Reform in an Era of Transition affect those lives). Just as implementation and process analyses can discover what works or does not in the delivery of program services, qualitative data can reveal the mechanisms and processes by which program services or offers of services are translated or incorporated into the lives of individual families. Formal survey and administrative data outcomes typically are too crude to ascertain the details of that mechanism. By studying the complexity of individual experiences, qualitative data can both illuminate more clearly how successful programs achieved their successes or illuminate why some programs have been unsuccessful or have had unexpected outcomes. This information can then be used to design improved programs or programs that are differently configured so as to avoid the undesirable outcomes. (In Chapter 5, we discuss how qualitative and ethnographic studies can also be used to enhance survey data collection.) EVALUATION METHODS FOR THE QUESTIONS OF INTEREST In Chapter 3 we delineated three formal evaluation questions of interest: What are the overall effects of structural welfare reform? What are the effects of individual, broad components of a welfare reform? What are the effects of alternative detailed strategies of welfare reform within each of the broad components? In this section, we discuss the evaluation methods that are appropriate to answer each of these questions. We begin with a key conclusion. Conclusion 4.1 Different questions of interest require different evaluation methods. Many questions are best addressed through the use of multiple methods. No single evaluation method can effectively and credibly address all the questions of interest for the evaluation of welfare reform. Table 4–1 gives a summary of how this conclusion plays out for the questions of interest and evaluation methods available. Estimating the Overall Effects of Structural Welfare Reform Estimating the overall effects of structural welfare reform of the type that has occurred in the 1990s—that is, a reform that bundles together a number of significant changes in the program whose joint impact is to change the basic nature of the welfare program(s) involved—is perhaps the most challenging question for evaluators. Structural reform affects the entire programmatic environment, from the top policy level to the way that local welfare offices operate. In a structural reform, families and individuals in low-income communities (both those on and

OCR for page 54
Evaluating Welfare Reform in an Era of Transition TABLE 4–1 Alternative Evaluation Methodologies for Different Questions of Interest   Questions of Interest Evaluation Methods Overall Effects Effects of Individual Broad Components Effect of Detailed Strategies Experimental Poorly suited Moderately well suited Well suited   Problems: contamination of control group; macro and feedback effects; entry effects; generalizability from only a few areas Need to be complemented with nonexperimental analyses for entry effects and generalizability Need to be complemented with nonexperimental analyses for generalizability and, possibly, entry effects Nonexperimental Moderately well suited Moderately well suited Poorly suited   Time-series modeling and comparison group designs using ineligibles are the most promising Cross-area comparison designs, followed over time, are the most promising Within-area matching designs may be the most appropriate, followed by cross-area comparison designs Problems: lack of cross-area program variation; data limitations Problems: lack of cross-area program variation; measurement of policies; data limitations Problems: extreme data limitations and lack of statistical power; uncertainty of matching reliability off welfare) change their expectations about welfare programs; the level of community and neighborhood resources are affected; governments involved in the program (federal, state, and local) alter their spending and taxation levels and the types of services they offer; and other agencies and private organizations that serve the low-income population change, often restructuring themselves to meet new demands for their services. In such a changed environment, neither experimental methods nor most traditional nonexperimental methods can provide reliable estimates of what would have happened to individuals and families in the absence of the reform having taken place. As noted previously in our discussion of the drawbacks to experimentation when cultural effects are part of the outcome, a control group in a randomized experiment that has been chosen just prior to the initiation of the reform will almost surely be affected by the broad effects created by the reform, thereby contaminating their outcomes as representing those that would occur in the absence of reform. This makes experimental comparisons subject to unknown bias. Nonexperimental methods that rely on cross-area variation are also

OCR for page 54
Evaluating Welfare Reform in an Era of Transition receipt) was composed of each of these three types and that these compositions were fairly consistent across most definitions of long-termers, short-termers, and cyclers. A somewhat unexpected finding is that the degree of total welfare dependence—measured as the total time a woman spends on welfare in a long nineteen-year period, 1979–1997, was greater for cyclers than for long-termers. This is contrary to expectations for the ordering of recipients. The greater number of spells experienced by cyclers led them to a longer total time on welfare than long-termers. Some long-termers had one or two long spells but then left welfare for the rest of the period. This finding raises the issue of defining long-termers, cyclers, and short-termers; it is discussed more thoroughly by Moffitt (2001).24 In examining the characteristics of the three groups, Moffitt found, as expected, that short-termers had the highest earnings when off welfare and the highest levels of education. However, again rather surprisingly, he found that cyclers had about the same levels of education as long-termers but lower levels of earnings off welfare. Thus, cyclers in his analysis seem to be the most disadvantaged of the three experience groups. Stevens used Maryland administrative data and decomposed the AFDC-TANF caseload from 1985 to 1998 into the three experience groups (using all AFDC cases opened and closed sometime during this period). He disaggregated the data into four separate birth cohorts, each observed for a 10-year period, within this time interval. Using similar definitions to those of Moffitt, Stevens found that almost 50 percent of the Maryland caseload were short-termers (a higher fraction than Moffitt found), about a third were long-termers, and the smallest group (about 20 percent) were cyclers. Moreover, he found that the fraction of short-termers had fallen slightly over time and the fraction of cyclers had risen, perhaps the result of welfare reform or changes in the economy. When he examined earnings off welfare for the three experience groups, he found the expected ordering—highest earnings for short-termers, lowest for long-termers, and in between for cyclers, at least for the majority black population. However, for the white population, he found a changing ordering over time, beginning with the expected ordering (as for blacks), but, by the last cohort, cyclers had lower off-welfare earnings than long-termers. This interesting result, combined with the increase in the number of cyclers, suggests that many of the more disadvantaged women on welfare have become cyclers, again possibly as a result of welfare reform or changes in the economy. 24   For example, one could define both long-termers and short-termers not by the number of spells and their lengths, but simply by the level of total time on, defining long-termers as those with long total time on and short-termers as those with little time on. How cyclers would then be defined is unclear. If they are defined as those with a lot of time on and with a large number of spells, long-termers would have to be restricted to those with a small number of spells. Moffitt argued that the common sense idea of a cycler is not based on total time on but is based simply on the number of spells, and that the degree of total time on should be an outcome measure from the definition rather than part of the definition.

OCR for page 54
Evaluating Welfare Reform in an Era of Transition Ver Ploeg analyzed the welfare leaver data used in one of the well-known Wisconsin leaver studies (Cancian et al., 2000). The sample included all those who received AFDC July 1995, some of whom subsequently left welfare and some of whom stayed on welfare. AFDC and earnings records for the years 1989–1995 were used to classify leavers and stayers by their pre-1995 welfare experience, using similar definitions to those of Moffitt and Stevens. Ver Ploeg found that about 50 percent were long-termers, about 35 percent were short-termers, and the residual (about 15 percent) were cyclers. These percentages are different than those of either Moffitt and Stevens and suggest that there is considerable diversity across states in the composition of the welfare caseload. Ver Ploeg also found that long-termers were much less likely to be leavers than were stayers, short-termers were much more likely to be leavers, and cyclers were somewhat more likely to be leavers than stayers. When looking at other characteristics, Ver Ploeg found that those who were more welfare-dependent—longer spells and those with less work experience—were considerably less likely to leave the rolls subsequent to 1995 than those with less welfare dependency— shorter spells and more work experience. When looking at the differential wage and employment outcomes of leavers, she found, perhaps surprisingly, differences by the three welfare experience groups that were quite modest: virtually all three groups had employment rates of 55–65 percent and all had approximately the same level of earnings. However, there were much stronger and more marked differences in leaver outcomes by the level of past work experience. Ver Ploeg also defined a “high barrier” group of initial recipients who had low levels of education, weak employment histories, and high levels of welfare dependency, and she found that they had much lower leaving rates and much worse outcomes after leaving than others. Overall, although Ver Ploeg failed to find as strong a correlation of earnings off welfare with welfare-experience groups as found by Moffitt and Stevens, she found high levels of heterogeneity between welfare stayers and leavers and between different types of leavers. This substantiates many of the points made in our interim report (National Research Council, 1999) about the need to differentiate leavers into different subgroups and to make comparisons within such groups across states, rather than comparing of overall averages. The lessons of these three studies for the importance of welfare dynamics for welfare reform are many. First and foremost, these studies show that the welfare caseload is extremely heterogeneous with respect to welfare experience, employment history, and other key variables. They further suggest that this heterogeneity is quite different across different states, which could lead to differences in outcomes as a result of that heterogeneity.25 Second, two of the studies show that heterogeneity in welfare experience is strongly correlated with employment and 25   As individual states continue to modify their programs to meet the needs of their populations, the heterogeneity of the caseload across states is likely to increase.

OCR for page 54
Evaluating Welfare Reform in an Era of Transition earnings off welfare and that heterogeneity in general is highly correlated with those variables. Third, the Ver Ploeg study demonstrated specifically for a leavers study the importance of disaggregating the caseload by heterogeneity measures, and how variable leaver outcomes are for different groups. These studies are the first in the welfare reform literature to focus on these issues and show the value of disaggregation along these dimensions. The panel recommends that this perspective be incorporated into more welfare reform studies, both within and across states, in future research and evaluation. Recommendation 4.5 A welfare dynamics perspective should be incorporated into more welfare reform studies, including leaver studies. In general, more disaggregation by levels of heterogeneity among leavers and stayers is needed given the importance of disaggregation for outcomes on and off welfare. ASSESSMENT OF CURRENT EVALUATION EFFORTS The scope, volume, and diversity of existing studies on welfare reform described in Chapter 2 is impressive. However, a large fraction of those studies, if not the majority, are not concerned with formal outcome evaluation. Many are concerned with monitoring the well-being of the low-income population or segments of it and are not aimed at estimating any of the effects or outcomes discussed in this chapter. The National Survey of America’s Families, the Devolution and Urban Change Study, the Three-City Study, and many of the studies using census and other data sets to track the progress of the low-income population are not intended to formally evaluate the effects of welfare reform but, instead, have as their primary purpose the monitoring of different welfare-affected groups.26 Although some have evaluation, neither these studies nor the many excellent implementation studies of welfare reform mentioned in Chapter 2 are reviewed here, for their goal is not formal evaluation. There are only three major types of existing projects whose primary goal is formal evaluation. These are studies of welfare leavers; randomized experiments; and caseload and other econometric studies. Even the first of these— leaver studies—is included only for discussion purposes, for most analysts agree that they are not intended as formal evaluations, at least as presently conducted. Leaver Studies The most common type of welfare reform study is the welfare leaver study, which examines the outcomes of a group of welfare recipients who have left the 26   Some of these studies, like the Urban Change Study and Three City Study, and in certain uses, the National Survey of America’s Families, have evaluation components. However, the major contribution of these studies to date is in their monitoring function.

OCR for page 54
Evaluating Welfare Reform in an Era of Transition welfare rolls in the postreform era. Taken as an evaluation methodology rather than a monitoring method, the question such studies aim to answer is that of the overall effects of structural welfare reform, rather than the effect of any individual component or detailed strategy. For this purpose, these studies are weak and do not deserve the emphasis that they have received in the discussion of the effects of welfare reform. Aside from problems of the underlying data, (see Chapter 5 and also National Research Council [1999]), leaver studies suffer from a narrowness of focus, lack of cross-state comparability, and, most obviously, lack of a comparison group. The narrowness of focus results from examining only a subset of the population affected by reform, generally ignoring stayers as well as divertees, rejected applicants, and discouraged nonapplicants. ASPE has begun to address the latter problem by funding projects to study applicants and diversion, but these efforts have yet to produce results and need to be strengthened and reinforced. The lack of cross-state comparability in the way leavers are defined, (who is classified as a leaver and who is not) and how outcomes are measured is a major barrier to being able to compare effects across areas and to correlate those with policy differences. The grantees whom ASPE funded to conduct new leaver studies have made some decisions on uniformity of definition. While ASPE deserves credit for this, it falls short of what is needed, for there remain many differences in composition across states. Finally, the lack of a comparison group makes the results of leaver studies difficult to interpret because it is not known whether their outcomes are any different from these for welfare leavers prior to welfare reform. This problem has also begun to be addressed by ASPE as part of its encouragement of multiple cohort designs. However, few states have embraced this method and thus few results are available. Constructing a comparison group for current leavers from past cohorts of leavers is more difficult than it may appear. Most of the multiple cohort studies discussed in Chapter 2 compare early post-PRWORA leavers to later post-PRWORA leavers, but what is needed is a comparison of post-PRWORA leavers to pre-PRWORA leavers. In addition, using pre-PRWORA leavers is problematic if a statewide welfare waiver was in place prior to 1996, for in that case even cohorts leaving AFDC just prior to PRWORA may have been affected by welfare reform. Another problem in the existing multiple cohort studies is that the question to be answered with such cohorts is not clearly defined. Most multiple cohort studies take any evidence of changing outcomes for leavers over time— such as, lower employment rates—as an indication that more women with low skills are leaving the rolls over time. However, this interpretation ignores the original purpose of multiple cohort designs, which is to estimate the effect of a policy change on the outcomes that a given recipient or type of recipient would have. Differences in leaver outcomes could reflect either changes in the characteristics of those who leave welfare or the true effects of a change in policy. None of the cohort studies conducted thus far attempt to separate these alternatives, nor

OCR for page 54
Evaluating Welfare Reform in an Era of Transition do many even acknowledge that it is an issue. Finally, even with a correct cohort definition, more than one pre-PRWORA cohort is needed. Estimating the effects of the unemployment rate and other policy developments requires several cohorts over time. Conclusion 4.7 Studies of the outcomes of welfare leavers contribute only one part of the story of welfare reform and, as an evaluation method, have been disproportionately emphasized relative to other methods. Studies that compare current leavers to those who left welfare prior to welfare reform and studies of divertees, applicants, and nonapplicant eligibles need more emphasis. Recommendation 4.6 More methodological research is needed to assess and improve the credibility of the multiple cohort method of evaluating the overall effects of welfare reform. This research needs to study the best method to control for the time-series effects of other policies and the economic environment and how many cohorts are enough to do this. Randomized Experiments The number of randomized experiments about welfare has declined in number since the passage of PRWORA. Most of the experiments in the early 1990s were the result of requirements by DHHS that any state granted a waiver from federal AFDC regulations was obligated to conduct an evaluation of that waiver, usually a randomized experiment. Since PRWORA, the federal government has lost its authority to mandate experiments; as the task of evaluation has moved to the states, there have been fewer experiments. To a considerable degree, this decline has been a natural result of the recognition that experimentation is not particularly appropriate, for assessing the overall impact of a state’s new welfare program. Another reason for the decline of experiments has been a lack of interest among many state policy makers in using the old AFDC program as a counterfactual, for their general belief is that a return to the AFDC program is unlikely. Still another reason for the decline in experimentation is that most states have been doing considerable work in developing new programs in the post-PRWORA environment and have not faced a sufficiently settled and stable policy environment to consider experimentation. There are a number of experiments ongoing from the pre-PRWORA waiver phase of experimentation and even one experiment that was initiated after the 1988 Family Support Act to evaluate the JOBS program. These experiments are of mixed usefulness for a number of reasons. In many cases the policy environment has changed. In addition, the many systemwide changes that have occurred over the 5 years since PRWORA was passed have unquestionably had spillover effects into the control groups, whose members are now unlikely to have out-

OCR for page 54
Evaluating Welfare Reform in an Era of Transition comes that are the same as would have occurred if welfare reform had not taken place. This problem is particularly acute when a reform in question is implemented statewide, and the control group on the old AFDC program is only a small group of recipients in the context of a statewide altered programmatic environment. The experiments that have been undertaken over the past decade have generally been aimed at estimating the overall effects of a bundle of separate welfare reforms, including work requirements, sanctions, time limits, and other provisions, all enacted and tested simultaneously. With rare exceptions, there have been no experiments that have isolated individual broad components or detailed strategies, varying each while holding all the other features of welfare reform fixed.27 Although experiments of similar policy bundles have often been tested in more than one site, there has been no attempt to coordinate those bundles in a way that would permit isolation of broad components or detailed strategies (i.e., with two sites differing only in one respect). Thus, although it would be advantageous to examine the effect of broad components, experiments have not been designed to do so. Although the recent experiments on welfare reform therefore have many problems of usefulness and validity, there is considerable scope for new experimentation on alternative detailed strategies and, to some extent, broad components. As noted above, experiments have their greatest advantage in doing so because incremental change is of most interest and the overall welfare environment would be more settled. As states continue to study the issues of what works and for whom, experiments should play an increasingly prominent role in evaluation efforts. ACF is planning experiments on alternative employment retention strategies, which is a good example of tests of detailed strategies. Such experiments need to be supplemented by nonexperimental data collection in order to reach a complete picture of reform effects and to provide adequate generalizability. Experiments are also, in principle, still one of the better methods to test the effects of individual broad components—time limits, work requirements, sanctions, and other provisions—while holding other components fixed. Whether they should be used to do so depends on the degree of policy interest in those components. When they are appropriate, well designed and conducted, and with an adequate sample size, experiments offer uniquely strong evidence. Recommendation 4.7 Experimental methods are underused in current designs of new welfare policy evaluations and should be employed in future studies evaluating different detailed reform strategies and different individual broad components. 27   Some of the waivers did include evaluations of broad components of reform, for example, randomizing clients into a labor force attachment group or a human capital development group.

OCR for page 54
Evaluating Welfare Reform in an Era of Transition One of the obstacles to using experimental methods is that evaluation is predominantly now in the hands of state welfare administrators, who do not have a great deal of experience in designing experiments or know the operational implications of conducting experiments. Historically, state welfare agencies have not conducted much evaluation of their programs (although there are notable exceptions, many times including partnerships of state administrators with universities). Most evaluation has been initiated at the federal level and conducted by national research organizations. The devolution of legal authority for program design embodied in PRWORA was accompanied by a devolution of program evaluation, for the most part. Consequently, the lack of experienced evaluation personnel at the state level is a significant barrier to the use of experimentation, and to evaluation in general, on welfare reform. Much welfare evaluation expertise still remains in academia or with federal agencies, particularly in ASPE and ACF. It is therefore natural for those federal agencies to continue to play an active role in sponsoring experiments at the state level and to promote such activities. In the absence of a strong federal presence, the lack of experienced personnel at the state level will result in many lost opportunities for fruitful experimentation. The federal government has a role in assisting in the design of an overall coherent strategy of controlled variation across different states. A study within one state can be a substantial benefit to that state and can contribute to the overall pool of knowledge about programs and their effects; a cross-state experimental evaluation program with comparable studies can go further in yielding generalizability of findings and can subsequently benefit all states. Recommendation 4.8 The federal government should take a proactive role in sponsoring experiments at the state and local levels and should encourage planned variation and cross-state comparability to yield the maximum general knowledge. Caseload and Other Econometric Models A number of caseload and other econometric models have been used in evaluating welfare reform, as described in Chapter 2. All of them aim to estimate the overall effects of welfare reform, and a few attempt to estimate the effects of individual broad components as well. Perhaps the most distinguishing feature of these studies is that they are the only welfare reform studies that have attempted to control for economic conditions and to isolate the effect of welfare reform from those conditions. While ambitious and deserving of investigation, these modeling efforts have thus far yielded a mixed record of success. They have produced some interesting findings and, in fact, the only findings on the overall effect of PRWORA controlling for the business cycle. However, there are significant problems with the studies that cast doubt on their validity. One problem is that the majority of the

OCR for page 54
Evaluating Welfare Reform in an Era of Transition studies have used cross-state and over-time variation in pre-PRWORA programs to estimate the effects of welfare reform. Although the pre-PRWORA waiver programs are of significant interest in and of themselves, in most cases they cannot yield reliable information about PRWORA. PRWORA accelerated and greatly strengthened most of the provisions that were in waiver plans and also created overall structural changes in the welfare system. In addition, the sample sizes available in the major data sets are only barely capable of capturing welfare reform effects of the size to be expected. Thus, data limitations significantly reduce the value of these studies. A handful of studies have been conducted using comparison-group designs with ineligibles and differences-in-differences methods (see Chapter 2). These studies have yielded significant and interesting results of overall effects. However, the validity of these comparison groups (see discussion of this method above) has not received sufficient examination, leaving the results from these studies in a state of considerable uncertainty. Despite the problems with the existing caseload and other econometric models to estimate overall effects, they have yielded reasonably credible estimates because they have found significant effects on the outcomes that would be expected and because the magnitude of the expected effect is large. Thus, there is a reasonable chance that the biases that may exist are outweighed by the size of the effects. The record of the econometric studies in estimating the effects of individual broad components is considerably worse. These studies have used pre-PRWORA cross-area variation in those components, in some cases, and post-PRWORA variation in a smaller set of policies (namely, those that vary cross-sectionally post-PRWORA). The results in these studies for the effects of components is highly variable in magnitude, sign, and significance, and are generally not robust to specification changes. The results often do not accord with sensible expectations, an indication of a underlying misspecification. It is quite probable that the combination of poorly measured policies at the broad component level, combined with sample size problems, have produced this result. Conclusion 4.8 Caseload and other econometric models have produced a mixed set of results, partly because of data limitations and partly because of an inherent lack of policy variability. They have done somewhat better at producing ballpark estimates of the overall effects of welfare reform than at producing estimates of the effects of individual broad components. Summary Despite the large number of studies that have been and are being conducted on welfare reform, the record on evaluation of the three major questions we have

OCR for page 54
Evaluating Welfare Reform in an Era of Transition put forth is not impressive. For the overall effect of PRWORA, a number of econometric models provide approximate estimates on individual outcomes and state caseloads, but these studies are weakened by data limitations and lack of policy variability. Experimental tests of the overall effects of PRWORA have also been conducted, have many limitations. There have also been econometric estimates of the effect of individual broad components of welfare reform, but these have more serious problems than those estimating the overall effects. Thus far, the results are not very reliable and lack credibility. There have been no experimental tests of the effects of adding or subtracting broad components. For detailed strategies, there have been almost no formal evaluations that isolate one strategy from all others (holding the others fixed) to determine the effect of the isolated strategy alone. There have been a few experiments of detailed strategies (e.g., the NEWWS demonstration), but these have the problem of control group contamination. NEXT STEPS For a mature society like the United States, with over 40 years of experience in evaluating social welfare programs, the record of accomplishment for a major piece of social legislation to date is not sufficient. There are several evaluation studies in process that can help address some of these gaps (see Chapter 2). The multiple cohort leaver studies already funded by ASPE and their other studies of stayers and divertees, rejected applicants, and discouraged nonapplicants will be valuable additions. The experiments planned by ACF will begin the process of testing alternative detailed strategies. Nevertheless, major new evaluation efforts are needed at the federal and state levels if the questions of interest for welfare reform research identified in Chapter 3 are to be addressed. The current set of evaluation efforts is an uncoordinated collection of disparate efforts without any overall coherence. Consequently, there are major gaps in the evaluation structure. Some private foundations have attempted to coordinate evaluation studies, but this is a role that should be played by DHHS because it is the agency with responsibility for program operations, access to details about the program and related programs, and the entity that is the most likely to have a long-term commitment to evaluation of the program. Setting forth a clear and carefully considered agenda for the questions to be asked and the evaluation methods that should be brought to bear on each of the questions would go a long way toward ensuring that the necessary analysis is conducted. A leadership role in this area is needed. Recommendation 4.8 The federal government, taking all agencies as a whole, has produced and funded a great deal of valuable monitoring research and a much smaller volume of evaluation research. A greater effort to produce a comprehensive evaluation framework

OCR for page 54
Evaluating Welfare Reform in an Era of Transition for social welfare programs that considers the major questions of interest and the evaluation methods appropriate for each is needed. A comprehensive framework for evaluation should be developed and used to guide the evaluation efforts under way by private and other public evaluation organizations. This should be an on-going effort as new issues emerge and is a responsibility that should be taken on by ASPE in the U.S. Department of Health and Human Services. In addition, the annual report to Congress recommended in Chapter 3 should include both a discussion of the important questions of welfare reform we outlined and a presentation of the alternative evaluation methods that are currently being used to study these questions, including those studies funded by ASPE as well as by others. The report should discuss the relative mix of experimental and nonexperimental methods being used and should present the agency’s views on whether the appropriate balance and mix is being achieved, in light of the relative strengths and weaknesses of each evaluation method. It should discuss which nonexperimental methods are being used and whether there is an appropriate balance for them. It should also relate ASPE’s own research agenda on evaluation methods to the overall landscape of evaluation and should present what it sees as its own role in support of good evaluation methods. Recommendation 4.9 In its annual report to Congress, ASPE should review the existing landscape of evaluation methods, whether the appropriate balance of experimental and different nonexperimental methods is being achieved, and how evaluation methodology fits into its own research agenda. At the state level, the capacity to conduct evaluations is very weak, both experimental and nonexperimental evaluations. This situation must be addressed if better and more appropriately focused and directed evaluations are to take place. Here we recommend again that the federal government exert a leadership role in assisting states. In fact, both ASPE and ACF already expend some portion of their personnel and resources toward such assistance, for example, through the welfare reform research and welfare outcomes conferences they have hosted for the past 3 years. But much more capacity-building effort is needed. Conclusion 4.9 The panel finds that state capacity and resources to conduct evaluations of their own welfare reform programs is often below the level needed for such an important change in policy. Recommendation 4.10 The panel recommends that the U.S. Department of Health and Human Services continue and expand its efforts to build capacity for conducting high-quality program evaluations at the state level through the provision of technical assistance,

OCR for page 54
Evaluating Welfare Reform in an Era of Transition convening of research conferences, promoting the exchange of technical assistance among the states, and other capacity building mechanisms. Finally, given the decentralized nature of the evaluation of PRWORA, and given the disparate methods that have been and will be used and the diversity of different approaches to evaluation that have been conducted, a major attempt to synthesize findings will be needed. Reconciling conflicting findings, combining experimental and nonexperimental results when appropriate, weighing the results of different studies considering their strengths and weaknesses, combining quantitative and qualitative data, drawing lessons from monitoring studies as well as evaluation studies, and identifying and filling gaps in knowledge in order to arrive at a comprehensive, best-guess judgment on the different effects of welfare reform will be a challenging task. But it is a necessary one. Once again, we recommend that the federal government, whose interests are those of the nation as a whole as charged by the electorate, take a leadership role in this regard and fulfill the synthesizing function. ASPE, as the policy evaluation and development arm of DHHS, is the most appropriate agency to fill this role. Recommendation 4.11 The panel recommends that ASPE be the primary agency responsible for synthesizing findings from studies of the consequences of changes in welfare programs.